Generalizable Underwater Acoustic Target Recognition Using Feature Extraction Module of Neural Network

: The underwater acoustic target signal is affected by factors such as the underwater environment and the ship’s working conditions, causing the generalization of the recognition model is essential. This study is devoted to improving the generalization of recognition models, proposing a feature extraction module based on neural network and time-frequency analysis, and validating the feasibility of the model-based transfer learning method. A network-based ﬁlter based on one-dimensional convolution is built according to the calculation mode of the ﬁnite impulse response ﬁlter. An attention-based model is constructed using the convolution network components and full-connection components. The attention-based network utilizes convolution components to perform the Fourier transform and feeds back the optimization gradient of a speciﬁc task to the network-based ﬁlter. The network-based ﬁlter is designed to ﬁlter the observed signal for adaptive perception, and the attention-based model is constructed to extract the time-frequency features of the signal. In addition, model-based transfer learning is utilized to further improve the model’s performance. Experiments show that the model can perceive the frequency domain features of underwater acoustic targets, and the proposed method demonstrates competitive performance in various classiﬁcation tasks on real data, especially those requiring high generalizability.


Introduction
Underwater acoustic target passive recognition is a technology that is used to recognize the target type through a sonar system based on target radiation noise.Generally speaking, different ship targets have different hull structures, mechanical vibration characteristics and propeller structures.These factors lead to differences in radiated noise.Furthermore, due to the difference in ship working conditions and the interference of time-varying and space-varying underwater acoustic channels and ocean noise, the ship's radiated noise collected by hydrophones is complicated and fuzzy.Complexity and fuzziness increase the difficulty of underwater acoustic target recognition.Therefore, improving the underwater acoustic target recognition performance of a sonar system can be difficult.
The method based on artificial intelligence enables complex data modelling and is suitable for algorithm design in complex scenes.Several researchers have applied artificial intelligence to underwater acoustic target recognition.Nowadays, ship target classification and recognition methods based on artificial intelligence are mainly divided into two kinds.One is the method of traditional machine learning, and the other is the method of deep learning.Classical machine learning methods include feature extraction and classifier design.Researchers extract various features from the ship's radiated noise signal based on traditional methods, such as waveform structure features [1][2][3], frequency characteristics and time-frequency analysis [4][5][6][7][8][9][10][11][12], and auditory perception features [13][14][15][16][17].Then, the extracted features are input into the traditional machine learning classifiers, such as the classifier based on the statistical analysis method or the classifier based on the simple neural network [18,19].Although traditional machine learning methods can complete some recognition tasks, recognition accuracy is limited by the complex underwater environment, diversification of ship target working conditions, feature extraction of excessive artificial intervention and simple classifier design.To solve these problems, researchers applied deep learning to underwater acoustic target recognition.Generally speaking, the underwater acoustic target recognition algorithm based on deep learning first inputs the primary features or original signals into the deep neural network.The neural network then learns from a large amount of data to generate high-level embedded representations.Finally, these embedded high-level representations are used for classification.In recent years, a large number of studies have been conducted on neural network structure [20][21][22][23] and learning strategies [24,25] based on simple feature input or raw signal input.Some studies have been conducted on the performance and integration form of multiple feature combinations [26][27][28] based on the combination feature method.Furthermore, there are some methods to study data enhancement [29] and data generation [30] for deep network training.The deep learning method is a data-driven technology.It learns feature extraction and target representation from training data, avoiding the inefficiency and information loss of manual feature extraction.However, due to the lack of underwater acoustic target data, there are still many problems in the application of deep learning for underwater acoustic target recognition tasks, which are worthy of further study.When deep learning is used to conduct features on limited data, the neural network may pay too much attention to information or noise irrelevant to target features but related to dataset attributes in the learning process, and the features extracted may only apply to the training dataset and lack interpretability.The random initialization of the neural network model parameters, however, introduces many uncertainties, which aggravates the models' over-fitting on the limited amount of data.These defects result in the weak generalization ability of the underwater acoustic target recognition algorithm based on deep learning and limit the practicality of the underwater acoustic target recognition algorithm.For example, in the case of a limited amount of training data, the recognition of underwater acoustic targets of the same voyage shows high accuracy, but the accuracy of underwater acoustic targets of different voyages is seriously reduced.Similarly, many fields cannot directly use deep learning to model simply because of data attributes.Researchers try to use domain knowledge or optimization-based strategies to assist modeling and have made some progress [31][32][33][34].
In this paper, to solve the problem of generalization modeling using deep learning, the design of an interpretable algorithm and the deployment of a transfer learning method are considered.In terms of interpretable algorithm design, instead of piling up the network structures, this paper proposes a feature extraction module based on a neural network, which integrates key technologies of signal processing and neural networks, such as digital filtering technology, time-frequency analysis technology and attention mechanisms.Using neural network learning, we try to optimize the design of intelligent algorithms from an interpretable perspective.In particular, the neural network-based feature extraction module receives a one-dimensional signal from the time domain and applies a neural network to realize digital filtering and time-frequency feature extraction.The frequency band suitable for the current classification task is mined from the signal by the feature extraction module, and the frequency response of the neural network can be output in real-time.For the deployment of transfer learning methods, inspired by image recognition, researchers train the neural network model on the large-scale dataset of ImageNet [35] and transfer the trained model to downstream image processing tasks, achieving good results.However, in the field of audio pattern recognition, the performance of pre-trained audio pattern recognition systems on large-scale datasets is still a problem yet to be solved.For underwater acoustic target recognition, the feasibility of the pre-trained model needs to be discussed and verified, especially when the pre-training task is not related to the underwater acoustic target recognition.The underwater sound recognition performance of the pre-trained model trained on large-scale audio pattern recognition data is verified in this paper.
The following sections are divided into four parts.Section 2 describes the design of the feature extraction module of an attention-based neural network (FEM-ATNN), including the design of a time domain filter based on a single convolution kernel and the design of the Fourier transform module based on the attention mechanism.Section 3 describes the selection of the network model and the validation of model-based transfer learning.Section 4 discusses the feasibility and effectiveness of the proposed method in underwater acoustic target classification tasks and conducts various experiments.Finally, the full text is summarized in Section 5.

Feature Extraction Module of Attention-Based Neural Network
Inspired by time-frequency analysis and the characteristics of neural networks, this paper constructs a feature extraction module based on neural networks, designing a set of attention-based digital filters to perform time-domain filtering and extract time-frequency features.The proposed module can process raw signals end-to-end, which perceives the optimization parameters according to the specific task and improves the generalization of the recognition algorithm.Firstly, a time-domain filter based on a single one-dimensional convolution kernel (1D-CK) is proposed, called a time-domain filter with convolution kernel (TFCK).TFCK can sense the frequency response of a specific classification task and implement the equivalent function of a finite impulse response (FIR) filter with the linear phase.Secondly, a time-frequency analysis module of an attention-based neural network (TFA-ATNN) is realized using the fully connected network and a set of 1D-CKs.The Fourier transform component is constructed by 1D-CK.A set of components are used to construct the time-frequency transposition layer and conduct the time-frequency information extraction.The information fusion layer is used to fuse the outputs of the time-frequency transposition layer, and the fusion results are sent to the attention generator to extract features.A brief description of FEM-ATNN is shown in Figure 1.
achieving good results.However, in the field of audio pattern recognition, the performance of pre-trained audio pattern recognition systems on large-scale datasets is still a problem yet to be solved.For underwater acoustic target recognition, the feasibility of the pre-trained model needs to be discussed and verified, especially when the pre-training task is not related to the underwater acoustic target recognition.The underwater sound recognition performance of the pre-trained model trained on large-scale audio pattern recognition data is verified in this paper.
The following sections are divided into four parts.Section 2 describes the design of the feature extraction module of an attention-based neural network (FEM-ATNN), including the design of a time domain filter based on a single convolution kernel and the design of the Fourier transform module based on the attention mechanism.Section 3 describes the selection of the network model and the validation of model-based transfer learning.Section 4 discusses the feasibility and effectiveness of the proposed method in underwater acoustic target classification tasks and conducts various experiments.Finally, the full text is summarized in Section 5.

Feature Extraction Module of Attention-Based Neural Network
Inspired by time-frequency analysis and the characteristics of neural networks, this paper constructs a feature extraction module based on neural networks, designing a set of attention-based digital filters to perform time-domain filtering and extract time-frequency features.The proposed module can process raw signals end-to-end, which perceives the optimization parameters according to the specific task and improves the generalization of the recognition algorithm.Firstly, a time-domain filter based on a single onedimensional convolution kernel (1D-CK) is proposed, called a time-domain filter with convolution kernel (TFCK).TFCK can sense the frequency response of a specific classification task and implement the equivalent function of a finite impulse response (FIR) filter with the linear phase.Secondly, a time-frequency analysis module of an attention-based neural network (TFA-ATNN) is realized using the fully connected network and a set of 1D-CKs.The Fourier transform component is constructed by 1D-CK.A set of components are used to construct the time-frequency transposition layer and conduct the time-frequency information extraction.The information fusion layer is used to fuse the outputs of the time-frequency transposition layer, and the fusion results are sent to the attention generator to extract features.A brief description of FEM-ATNN is shown in Figure 1.A brief description of the method in this paper: firstly, TFCK is used to filter the raw signal; secondly, a set of convolution kernels are adopted to extract the time-frequency feature from the filtered signal; finally, an attention-based network is conducted to features.

Time-Domain Filter with Convolution Kernel
The FIR filter can retain the target signal and suppress the interference.It performs weighting processing on the continuous input signal, then obtains the filtered signal by Figure 1.A brief description of the method in this paper: firstly, TFCK is used to filter the raw signal; secondly, a set of convolution kernels are adopted to extract the time-frequency feature from the filtered signal; finally, an attention-based network is conducted to features.

Time-Domain Filter with Convolution Kernel
The FIR filter can retain the target signal and suppress the interference.It performs weighting processing on the continuous input signal, then obtains the filtered signal by accumulation.An Nth-order FIR filter multiplies N times and accumulates N-1 times to complete one filtering operation.This process is expressed as follows: where y(n) and x(n − k) represent the output and input of the filter, h(k) denotes the filtering coefficient, and N denotes the order of the filter.As for one-dimensional convolution operations with convolution kernels of odd length, it can be expressed as: where y(n) and x(sn + (N − 1)/2 − k) denote the output and input of the 1D-CK, w(k) denotes the weight of the 1D-CK, N denotes the kernel size of the 1D-CK, s denotes the stride of the convolutional layer, β denotes the bias, and f (•) denotes the activation function.
Therefore, an FIR filter can be designed based on a single 1D-CK that the bias and activation function of it are removed.When the stride and kernel size of convolution are set to 1 and N, 1D-CK can be treated as a Nth-order FIR filter with a delay of (N − 1)/2.For the classification task, the FIR filter conducted by 1D-CK can be regarded as an adaptive filter or a fixed-parameter filter, which can be optimized parameters by gradient descent adaptively or filtered the raw signal according to fixed optimal parameters, named TFCK.TFCK is shown in Figure 2.
accumulation.An Nth-order FIR filter multiplies N times and accumulates N-1 times to complete one filtering operation.This process is expressed as follows: where ( ) y n and − ( ) x n k represent the output and input of the filter, ( ) h k denotes the filtering coefficient, and N denotes the order of the filter.As for one-dimensional convolution operations with convolution kernels of odd length, it can be expressed as: where ( ) y n and .For the classification task, the FIR filter conducted by 1D-CK can be regarded as an adaptive filter or a fixed-parameter filter, which can be optimized parameters by gradient descent adaptively or filtered the raw signal according to fixed optimal parameters, named TFCK.TFCK is shown in Figure 2. The parameter design method of TFCK is the same as that of the traditional FIR filter.For a specific underwater acoustic target recognition task, TFCK can learn the filter's frequency response by gradient descent.In other words, it can automatically search for the frequency range suitable for the current classification task from the data.The adjustment strategy of TFCK includes two stages: the pre-trained stage and the training stage.In the pre-trained stage, all parameters of 1D-CK are adaptively learned according to a specific classification task.By observing the feedback of the neural network, we can analyze the inner workings and behavior of the models, strengthen the extraction of high-value information and suppress the perception of noise by the neural network.In the training stage, we can adjust and fix the parameters of 1D-CK according to the feedback of the network in the pre-trained stage, optimize the ability of the neural network to suppress low-value information and make the subsequent network easier to learn the generalized embedding features.Two stages are shown in Figure 3.The parameter design method of TFCK is the same as that of the traditional FIR filter.For a specific underwater acoustic target recognition task, TFCK can learn the filter's frequency response by gradient descent.In other words, it can automatically search for the frequency range suitable for the current classification task from the data.The adjustment strategy of TFCK includes two stages: the pre-trained stage and the training stage.In the pre-trained stage, all parameters of 1D-CK are adaptively learned according to a specific classification task.By observing the feedback of the neural network, we can analyze the inner workings and behavior of the models, strengthen the extraction of high-value information and suppress the perception of noise by the neural network.In the training stage, we can adjust and fix the parameters of 1D-CK according to the feedback of the network in the pre-trained stage, optimize the ability of the neural network to suppress low-value information and make the subsequent network easier to learn the generalized embedding features.Two stages are shown in Figure 3.It can be seen that in the classification task of ShipsEar, TFCK is very sensitive to lowfrequency information at the initial stage of training.The training process gradually amplifies the importance of low-frequency information, and the neural network finds several peaks with similar intervals.At the last stage of the training process, the perception of the neural network is finally stabilized within a range, and the redundant information for the current network architecture and the classification task is fed back.The high-frequency information of ship radiated noise is seriously lost through the underwater acoustic channel, and the low-frequency information can spread further in the underwater environment.Usually, the identifiable information from ship radiated noise received by the hydrophone is concentrated in the low-frequency of the raw signal.This result means that the knowledge of TFCK learned from the data is consistent with the cognition of experts    It can be seen that in the classification task of ShipsEar, TFCK is very sensitive to lowfrequency information at the initial stage of training.The training process gradually amplifies the importance of low-frequency information, and the neural network finds several peaks with similar intervals.At the last stage of the training process, the perception of the neural network is finally stabilized within a range, and the redundant information for the current network architecture and the classification task is fed back.The high-frequency information of ship radiated noise is seriously lost through the underwater acoustic channel, and the low-frequency information can spread further in the underwater environment.Usually, the identifiable information from ship radiated noise received by the hydrophone is concentrated in the low-frequency of the raw signal.This result means that the knowledge of TFCK learned from the data is consistent with the cognition of experts It can be seen that in the classification task of ShipsEar, TFCK is very sensitive to low-frequency information at the initial stage of training.The training process gradually amplifies the importance of low-frequency information, and the neural network finds several peaks with similar intervals.At the last stage of the training process, the perception of the neural network is finally stabilized within a range, and the redundant information for the current network architecture and the classification task is fed back.The highfrequency information of ship radiated noise is seriously lost through the underwater acoustic channel, and the low-frequency information can spread further in the underwater environment.Usually, the identifiable information from ship radiated noise received by the hydrophone is concentrated in the low-frequency of the raw signal.This result means that the knowledge of TFCK learned from the data is consistent with the cognition of experts in underwater acoustic target recognition, which is also the same as the objective laws of physics.According to the result, the differentiated information among categories in the data set is mainly concentrated below 600 Hz.Therefore, the parameters of TFCK can be optimized to complete the classification task better than before.The parameter S of Equation ( 2) is set to (4), and the 1D-CK is set according to the traditional low-pass FIR filter of 1500 Hz cutoff frequency.

Time-Frequency Analysis Module of Attention-Based Neural Network
As a classical time-frequency analysis method, the short-time Fourier transform (STFT) can reflect the frequency change of ship radiated noise over time.Firstly, TFA-ATNN uses a set of 1D-CKs to embed the Fourier transform into the neural network, so that the neural network is able to extract time-frequency features.Secondly, the attention mechanism is conducted to improve the perception ability of the neural network for frequency.Discrete Fourier transform (DFT) can decompose frequency components from complex time-domain waveforms and is an important method for signal analysis and processing.In this paper, DFT is realized based on 1D-CK.DFT can be expressed as Equations ( 3) and (4): where x(n) denotes the time-domain signal sequence, and the discrete Fourier transform and its inverse transform are DFT[•] and IDFT [•].In order to realize the discrete Fourier transform by convolution neural layer, Equation ( 3) is expressed in matrix form, as follows: . . .
where W is used to replace e −j 2π N .In addition, according to the Euler formula, Equation (3) can be decomposed into the representation of an imaginary component and a real component.The decomposition is as follows: The typical convolutional neural network layer is represented by Equation ( 7): where x l j represents the output feature, k l ij represents the convolution kernel, D l−1 represents the feature set, b l j represents the bias, j denotes the convolution kernel number, l denotes the layer number, and * is the convolution operation of the convolutional neural network.In particular, for the one-dimensional convolution layer that only inputs singlechannel time series signals, when the length of the input signal and the convolution kernel are equal and the bias, activation function and padding are removed, Equation ( 7) can be transformed into Equation (8): where x j (n) represents the output of the convolution kernel, x (n) represents the input signal, and w j (n) represents the weight of the convolution kernel.In order to realize the Fourier transform, we set two groups of convolution kernels to calculate the imaginary component and real component of the Fourier transform, respectively.The weights of these two groups of convolution kernels can be fixed according to the sine basis functions and cosine basis functions in Equation ( 6).The number of convolution kernels is constrained by the number of points in the Fourier transform, and it is also equal to the number of points in the Fourier transform divided by two and plus one.Therefore, a one-dimensional convolution layer based on removing bias, activation function and padding can realize Fourier transform operation.Furthermore, the STFT with different time resolution and frequency resolution can be realized by optimizing the size, step length and basis function of the convolution kernel, as shown in Figure 5.This module is called the basic STFT module based on the convolutional neural network (BSTFT-CNN).In addition, in order to attenuate sidelobe height and weaken the impact of spectrum leakage, Hanning window action on convolution kernels.Specifically, the Hanning window with length M is expressed as Equation ( 9): Appl.Sci.2022, 12, x FOR PEER REVIEW 7 of 18 where ( ) x n represents the output of the convolution kernel, ( ) ′ x n represents the in- put signal, and ( ) w n represents the weight of the convolution kernel.In order to realize the Fourier transform, we set two groups of convolution kernels to calculate the imaginary component and real component of the Fourier transform, respectively.The weights of these two groups of convolution kernels can be fixed according to the sine basis functions and cosine basis functions in Equation ( 6).The number of convolution kernels is constrained by the number of points in the Fourier transform, and it is also equal to the number of points in the Fourier transform divided by two and plus one.Therefore, a onedimensional convolution layer based on removing bias, activation function and padding can realize Fourier transform operation.Furthermore, the STFT with different time resolution and frequency resolution can be realized by optimizing the size, step length and basis function of the convolution kernel, as shown in Figure 5.This module is called the basic STFT module based on the convolutional neural network (BSTFT-CNN).In addition, in order to attenuate sidelobe height and weaken the impact of spectrum leakage, Hanning window action on convolution kernels.Specifically, the Hanning window with length M is expressed as Equation ( 9): Since the parameters of convolution kernels in the BSTFT-CNN are initialized by the standard sine basis functions and cosine basis functions, the network after the BSTFT-CNN will learn the frequency components extracted by the BSTFT-CNN indiscriminately.However, there are obvious frequency domain characteristics in ship radiated noise.The attention module of the short-time Fourier transform adopts the full connection layer with shared parameters (FCSP), hoping that the neural network can learn to automatically fuse the stable frequency components in the input signal and achieve a more stable recognition effect than before.This paper embeds an attention mechanism into the BSTFT-CNN to construct a convolution neural network time-frequency feature extraction module with an attention mechanism, which is called TFA-ATNN, as shown in Figure 6.The TFA-ATNN adopts the FCSP, hoping that the neural network will automatically extract the stable frequency components in the input signal to improve the stability of the recognition algorithm.Since the parameters of convolution kernels in the BSTFT-CNN are initialized by the standard sine basis functions and cosine basis functions, the network after the BSTFT-CNN will learn the frequency components extracted by the BSTFT-CNN indiscriminately.However, there are obvious frequency domain characteristics in ship radiated noise.The attention module of the short-time Fourier transform adopts the full connection layer with shared parameters (FCSP), hoping that the neural network can learn to automatically fuse the stable frequency components in the input signal and achieve a more stable recognition effect than before.This paper embeds an attention mechanism into the BSTFT-CNN to construct a convolution neural network time-frequency feature extraction module with an attention mechanism, which is called TFA-ATNN, as shown in Figure 6.The TFA-ATNN adopts the FCSP, hoping that the neural network will automatically extract the stable frequency components in the input signal to improve the stability of the recognition algorithm.
The TFA-ATNN is mainly divided into three stages.In stage 1, the TFA-ATNN uses two FCSPs to learn the information that stably exists in the imaginary or real component and combine the components and phase spectrum as a combined features.In stage 2, the FCSP is used to fuse the combined feature.Feature fusion generates two attention maps, which are used for enhancement of imaginary and real components, respectively.In the last stage, a learnable factor is used to combine the enhanced spectrum generated by the result of enhancement.Finally, the combination result is the output features that could be sent to the network for embedding and extracting.The TFA-ATNN is mainly divided into three stages.In stage 1, the TFA-ATNN uses two FCSPs to learn the information that stably exists in the imaginary or real component and combine the components and phase spectrum as a combined features.In stage 2, the FCSP is used to fuse the combined feature.Feature fusion generates two attention maps, which are used for enhancement of imaginary and real components, respectively.In the last stage, a learnable factor is used to combine the enhanced spectrum generated by the result of enhancement.Finally, the combination result is the output features that could be sent to the network for embedding and extracting.Specifically, the size of one of the component outputs by the BSTFT-CNN is ( ) , where T is the number of time points and NF is the number of frequency points.In order to make it easier for the model to capture the stable frequency distribution, FCSP is proposed to constrain the relationship between frequency components.Its operation is expressed as Equation ( 10): where A is one of the output features of the BSTFT-CNN, which includes features of real components and imaginary components.F is the parameter matrix of the FCSP, and its size is ( ) , NF NF in stage 1, which depends on input and output.B is the matrix out- put by the FCSP.i represents the imaginary component, r represents the real compo- nent, and subscript 1 represents stage 1.The number of neural units in the single layer of the FCSP is equal to the number of frequency points.
In stage 2, the TFA-ATNN combines the imaginary component, the real component and the phase component together, which is different from the classical attention mechanisms, such as SENet [37] and CBAM [38], to obtain attention information from the current layer to ensure that the model has sufficient information to extract the relationship between frequency components.The feature matrix is fused by concat, and the fused matrix is called C .After the C is obtained, it is input into the FCSP of stage 2, as in Equation ( 11): Specifically, the size of one of the component outputs by the BSTFT-CNN is (T, NF), where T is the number of time points and NF is the number of frequency points.In order to make it easier for the model to capture the stable frequency distribution, FCSP is proposed to constrain the relationship between frequency components.Its operation is expressed as Equation ( 10): where A is one of the output features of the BSTFT-CNN, which includes features of real components and imaginary components.F is the parameter matrix of the FCSP, and its size is (NF, NF) in stage 1, which depends on input and output.B is the matrix output by the FCSP.i represents the imaginary component, r represents the real component, and subscript 1 represents stage 1.The number of neural units in the single layer of the FCSP is equal to the number of frequency points.In stage 2, the TFA-ATNN combines the imaginary component, the real component and the phase component together, which is different from the classical attention mechanisms, such as SENet [37] and CBAM [38], to obtain attention information from the current layer to ensure that the model has sufficient information to extract the relationship between frequency components.The feature matrix is fused by concat, and the fused matrix is called C. After the C is obtained, it is input into the FCSP of stage 2, as in Equation ( 11): where f (•) is the gate function, here it is sigmoid.B i,2 dot multiplication (1 + A i ) to get D i , which is called the enhancement imaginary component.B r,2 dot multiplication (1 + A r ) to get D r , called the enhancement real component.Finally, the weight factor p with learning ability is used to fuse enhanced spectrums of the real and imaginary components.It is convenient for the neural network to initialize the weights of attention mechanisms adaptively according to tasks and parameters.Equation ( 12) describes the enhancement process: where M denotes the features extracted by the TFA-ATNN.Except for the parameters in the BSTFT-CNN, all parameters in the TFA-ATNN are adjusted during the training stage.The parameters that can be adjusted during the training stage in the TFA-ATNN are presented in Table 1.

Deployment of Underwater Acoustic Target Recognition Network and Validation of Model Based Transfer Learning
Deep learning is a data-driven technique where training is performed on datasets and the trained models can be used to handle specific tasks.Data is scarce for the underwater acoustic target recognition task, especially in specific application scenarios.Training the model initialized by random on scarce data always limits the model's performance.Initializing the model with pre-trained parameters may improve the model's performance, but this improvement depends on the size, type of pre-trained data, and the way transfer learning is done.The improper use of the pre-trained model will introduce negative optimization, resulting in a decline in performance.In audio pattern recognition, many researchers are exploring the effectiveness of the pre-trained model.
Inspired by Large-Scale Pretrained Audio Neural Networks (PANNs) [39], we conducted experiments to verify the applicability of the pre-trained model of the audio pattern recognition task in the underwater acoustic target recognition task.This paper transfers the audio pattern recognition model trained on large-scale data (AudioSet [40]) to an underwater acoustic target recognition task.Firstly, a strong audio pattern recognition model is trained under the condition of large-scale audio pattern recognition data.Then, the pretrained model is transferred to the underwater acoustic target recognition task.Specifically, the parameters of the underwater acoustic target recognition model are initialized by using the parameters of the audio pattern recognition model.Finally, the transferred underwater acoustic target recognition model is trained on a specific underwater acoustic signal dataset.A typical deep neural network for classification is usually composed of two parts.The first part is the backbone network (Backbone) for extracting high-level features, including convolutional neural networks (CNN), transformers, time-delay neural networks (TDNN), etc.The other part is the neural network used for classification, and the fully connected network (FC) is a classic classification neural network.In order to improve the high-level feature extraction performance of neural networks, this paper implements model-based transfer learning on the backbone network.The schematic diagram of the model-based transfer learning strategy in this paper is shown in Figure 7.
The Visual Geometry Group (VGG) of Oxford University has proposed a backbone network with superior feature extraction performance called VGGNet [41].The backbone network performs superiorly in image recognition, semantic segmentation, speech processing and other fields.Since then, researchers have applied VGGNet as a feature extractor to large-scale audio event detection tasks and achieved excellent results [40,42], called VGGish.In terms of model selection of the backbone network, this paper designs two backbones based on VGGish.VGGish is composed of continuously stacked convolution kernels, which have strong feature extraction performance and are easy to implement and expand.Typically, algorithms implemented on the VGGish can easily embed other deep learning techniques to further improve performance.Therefore, VGGish is suitable as the backbone for algorithm performance verification.In this paper, we use a pre-trained model to initialize the parameters in the backbone, and only initialize the common layers when the layers of different backbones are not completely consistent, to further ensure the versatility of the algorithm.The method and parameters of pre-training according to PANNs.Backbone 1 and Backbone 2 are shown in Table 2.In the convolutional layer, the C(64,3,1) means that there are 64 convolution kernels, the size of each is 3 × 3, and the stride is 1.Avg-pooling(1,2) means that the average pooling is 1 × 2.  The Visual Geometry Group (VGG) of Oxford University has proposed a backbone network with superior feature extraction performance called VGGNet [41].The backbone network performs superiorly in image recognition, semantic segmentation, speech processing and other fields.Since then, researchers have applied VGGNet as a feature extractor to large-scale audio event detection tasks and achieved excellent results [40,42], called VGGish.In terms of model selection of the backbone network, this paper designs two backbones based on VGGish.VGGish is composed of continuously stacked convolution kernels, which have strong feature extraction performance and are easy to implement and expand.Typically, algorithms implemented on the VGGish can easily embed other deep learning techniques to further improve performance.Therefore, VGGish is suitable as the backbone for algorithm performance verification.In this paper, we use a pre-trained model to initialize the parameters in the backbone, and only initialize the common layers when the layers of different backbones are not completely consistent, to further ensure the versatility of the algorithm.The method and parameters of pre-training according to PANNs.Backbone 1 and Backbone 2 are shown in Table 2.In the convolutional layer, the C(64,3,1) means that there are 64 convolution kernels, the size of each is 3 × 3, and the stride is 1.Avg-pooling(1,2) means that the average pooling is 1 × 2.   Vigo port is one of the largest fishing ports in the world, and there is a huge flow of fishing vessels on the waterway.The data-target categories collected in this area are diverse.Researchers deploy the hydrophones under the water and schedule labels according to vessel movement information obtained from the port authority and the Automatic Identification System for vessels.Original recordings were clipped to preserve information from the beginning to the end of the event or pass-by.ShipsEar comes from these edited recordings, which included 90 recordings in wav format lasting from 15 s to 10 min.The recordings contain 11 types of ship radiated noise signals, of which types of ship radiated noise signals can be divided into five categories based on vessel size according to [36], as is shown in Table 4.Each recording of ShipsEar contains only one type of vessel.The records in different recordings may come from different voyages, even if they are the same vessel type.

Experimental Methods
This section describes the experimental design and experimental details.This paper proposed FEM-ATNN to improve the underwater acoustic target recognition model's accuracy, robustness and generalization.In addition, we verify the feasibility of applying the audio data pre-trained model to underwater acoustic target recognition.Due to the diversity in the speed of vessels, environment and navigation states, there are differences in ship radiated noise under different voyages.In short, the difference in radiated noise is closely related to the voyage.The difference in radiated noise at the beginning and end of a long-term voyage is greater than that in a short period.The difference in radiation noise between different voyages is more likely than that of the same voyage.For an intelligent system, the greater the difference between training and test data, the more generalized the model will be.Generally speaking, there are three ways to divide training data and test data: random segmentation, front/back segmentation, and different recording segmentation.Three types of division represent three different task difficulties, from simple to challenging.The method of evenly segmenting the recordings and randomly selecting the training set and test set may obtain high accuracy, but it is too easy and cannot well evaluate the underwater acoustic target recognition algorithm because the training data and test data are highly similar in the near seconds.Therefore, this paper designs two classification tasks according to front/back segmentation and different recording segmentations based on ShipsEar, selects 88 recordings for experiments, and both tasks divide recordings into five categories according to Table 4. Task 1 is to construct a fourfold dataset by dividing each of the recordings into four pieces on average according to the time sequence.This division method can separate training data and test data to a certain extent, which is suitable for evaluating the fitting ability of the neural network model.Arranging each recording as training or test data in a 3-1 ratio is Task 2. Task 2 is difficult because the underwater acoustic target recognition algorithm should be able to identify the unknown voyages even if the specific target is not present in the training set.Task 2 is more suitable for evaluating the algorithm's generalization and practicability than Task 1.We use Network 1 in Task 1 and conduct Network 2 in Task 2. All data is downsampled to 16 kHz, and 3626 records of data are obtained through division and simple selection, and the duration of each data point is 3 s.
Based on two classification tasks, this paper arranges four groups of results comparisons to evaluate the proposed method's performance.The first results compare the FEM-ATNN with multi-resolution STFT on task 2. The second experiment uses standard STFT as the primary feature to verify the performance of model-based transfer learning on task 1.The third experiment conducts the FEM-ATNN and model-based transfer learning in the same model, evaluates it on Task 1 and Task 2, and compares it with Mel filter bank energy (FBank), Mel frequency cepstral coefficients (MFCC) and linear frequency cepstral coefficients (LFCC).The last group of results compares the proposed method with other methods using ShipsEar in recent years.All the networks are trained by random gradient descent.Adam is used as the optimizer.The training minibatch is set to 32, the initial learning rate is set to 0.005, and the learning rate is decreased once every 5000 steps.The decline factor is 0.1, and a total of 15,000 steps are trained.The extraction methods of FBank, MFCC and LFCC refer to torchaudio [43].FBank extracted with window size 2048 and hop length 1024.For FBank, the number of Mel filters is 128, followed by a logarithmic operation to extract input features.STFTs also utilize a logarithmic operation to extract input features in the training stage.For MFCC and LFCC, the number of mfc coefficients is set to 40, the number of linear filters is set to 128 and the number of lfc coefficients is set to 40.The convolution kernel size of TFCK is set to 63 as a low-pass FIR filter with a 1500 Hz cutoff frequency, and the parameters are locked in the training stage.The kernel size and strides of FEM-ATNN are set to 1024 and 512, respectively.Finally, we use recognition accuracy, recall rate, accuracy, and F1-score to evaluate the performance of the network.

Experimental Results and Discussion
This paper proposed a time-frequency analysis method based on neural networks, and our calculation process and implementation were derived from the original Fourier transform.Therefore, the first experiment in this paper compares the proposed method with the short-time Fourier transform method to consider the advantages of the proposed method.The STFTs with different parameters carry different information because STFTs with different parameters have different frequencies and time resolutions.It can be predicted that different pieces of information will lead to different final recognition performances.Therefore, this paper selects the STFT of a series of typical parameters for comparison to evaluate the performance of the proposed method.Experiment 1 is carried out on the most challenging Task 2 to evaluate the generalization of various methods.The STFT with multi-resolutions as the primary feature is extracted to the same backbone network, and the classification performance of these primary features is compared with the model based on the FEM-ATNN.The window length of STFTs is from 512 to 8192, and the hop length is set to half of the window length.Table 5 shows the result of experiment 1.The STFT with different resolutions shows different performances in Experiment 1.Among them, the STFT has the best performance, with a window length of 2048.Its accuracy rate reaches 78.0%, which is significantly higher than others, especially compared with the window lengths of 4096 and 8192.The accuracy of the FEM-ATNN reaches 83.9%, which is 9.1%, 8.4%, 5.9%, 14.8% and 15.9% higher than the STFT with windows lengths from 512 to 8192, respectively.
Experiment 2 verifies the feasibility of transferring the model, which is pre-trained on the large-scale audio pattern recognition data, to the underwater acoustic target recognition task.The advantage of transfer learning is transfer knowledge from other fields to the current field.For a simple task with little difference between training data and test data, the effect is often not noticeable.Therefore, we not only use tasks with differences between training data and test data but also evaluate the boundary performance of model-based transfer when the training data and test data are similar.Task 1 provides a variety of test data and training data combinations, so we conduct experiments on Task 1 to facilitate a complete evaluation of the algorithm's performance.Task 1 divides each recording into four folds in chronological order.The differences between different folds and folds are not consistent.In other words, it is more challenging to use the head fold as the test set than the middle fold as the test set because the test data cut from the middle of the recordings is more similar to the surrounding training data.In order to evaluate it objectively, we used standard STFT as the primary feature in this experiment to compare the performance between the pre-trained and the random models.STFT-2048 is input into the backbone network as the primary feature because it is performed best in Task 2. The results of Experiment 2 are shown in Table 6.Table 6 shows the performance of the backbone network using pre-trained model parameter initialization and random initialization.The results show that model-based transfer learning can improve performance most of the time.The pre-trained model parameters can be used to initialize the backbone of the underwater acoustic target recognition model when the backbone is pre-trained on large-scale audio data unrelated to underwater acoustics.It is worth noting that the number of layers of the backbone used for underwater acoustic target recognition is different from that of the pre-trained backbone.We extract some layer parameters from the pre-training model to initialize all the layer parameters of the underwater acoustic target recognition backbone.That is, partial layer parameters are extracted from the pre-training model to initialize all layer parameters of the under- The methods in Table 8 show studies in recent years, and our method shows competitive performance.On the one hand, most of the methods in Table 8 segment the recordings and randomly sample the adjacent segments as the train or test data, which is lower in classification difficulty than Task 1 used in this paper.On the other hand, some methods in the table adopt data augmentation technology, such as methods of No.7 and No.8, which both used SpecAugment.The FEM-ATNN-trans does not use data augmentation technology but still performs well.

Conclusions
In this paper, a time-frequency feature extraction module based on the attention mechanism neural network is proposed, which combines the operation mechanism of the convolutional neural network, time-domain filtering and Fourier transform.The proposed method can extract the input time domain signal directly, which is an end-to-end training model.Classic deep learning methods search for features through neural networks such as black-box models, and it is difficult for researchers to analyze the inner workings and behavior of the models.The proposed method can output the frequency domain response of the neural network in real-time.It is convenient for researchers to understand the neural network learning process, which helps to strengthen the network model and improve its generalization.In addition, the feasibility of transferring non-underwater acoustic data as pre-trained data to underwater acoustic target recognition is verified.A series of classification experiments demonstrate the effectiveness of the proposed method, especially for tasks with a demand for model generalization ability.

Figure 1 .
Figure1.A brief description of the method in this paper: firstly, TFCK is used to filter the raw signal; secondly, a set of convolution kernels are adopted to extract the time-frequency feature from the filtered signal; finally, an attention-based network is conducted to features.

/ 2 )
x sn N k denote the output and input of the 1D-CK, ( ) w k denotes the weight of the 1D-CK, N denotes the kernel size of the 1D-CK, s denotes the stride of the convolutional layer, β denotes the bias, and ( ) ⋅ f denotes the activation function.Therefore, an FIR filter can be designed based on a single 1D-CK that the bias and activation function of it are removed.When the stride and kernel size of convolution are set to 1 and N, 1D-CK can be treated as a Nth-order FIR filter with a delay of − ( 1)/2 N

Figure 4
Figure 4 shows the frequency response of TFCK in the classification task of the ShipsEar [36] dataset.

Figure 4 .
Figure 4.The frequency response of TFCK and spectrograms of a filtered signal at a specific frequency band in the pre-trained stage.(a) The frequency response of TFCK; (b) Initialization; (c) The step of 100; (d) The step of 400; (e) The step of 1000; (f) End of training.

Figure 4 18 Figure 3 .
Figure 4 shows the frequency response of TFCK in the classification task of the ShipsEar [36] dataset.

Figure 4
Figure 4 shows the frequency response of TFCK in the classification task of the ShipsEar [36] dataset.

Figure 4 .
Figure 4.The frequency response of TFCK and spectrograms of a filtered signal at a specific frequency band in the pre-trained stage.(a) The frequency response of TFCK; (b) Initialization; (c) The step of 100; (d) The step of 400; (e) The step of 1000; (f) End of training.

Figure 4 .
Figure 4.The frequency response of TFCK and spectrograms of a filtered signal at a specific frequency band in the pre-trained stage.(a) The frequency response of TFCK; (b) Initialization; (c) The step of 100; (d) The step of 400; (e) The step of 1000; (f) End of training.

Figure 5 .
Figure 5.The schematic diagram of the BSTFT-CNN.

Figure 5 .
Figure 5.The schematic diagram of the BSTFT-CNN.
Appl.Sci.2022, 12, x FOR PEER REVIEW 10 of 18 acoustic signal dataset.A typical deep neural network for classification is usually composed of two parts.The first part is the backbone network (Backbone) for extracting highlevel features, including convolutional neural networks (CNN), transformers, time-delay neural networks (TDNN), etc.The other part is the neural network used for classification, and the fully connected network (FC) is a classic classification neural network.In order to improve the high-level feature extraction performance of neural networks, this paper implements model-based transfer learning on the backbone network.The schematic diagram of the model-based transfer learning strategy in this paper is shown in Figure 7.

Figure 7 .
Figure 7.The schematic diagram of model based transfer learning strategy.

Figure 7 .
Figure 7.The schematic diagram of model based transfer learning strategy.

Table 2 .
Backbones are used in this paper.

Table 2 .
Backbones are used in this paper.

Table 3 .
Classifiers are used in this paper.This paper conducts experiments on ShipsEar that consist of recordings from different regions of the Spanish Atlantic coastline in northwestern Spain during the autumn of 2012 and the summer of 2013, most of the data was collected at Porto Vigo (42 • 14.5 N 008 • 43.4 W) or nearby.The Port of Vigo is located within the Vigo River, a submerged river valley 35 km long, 10 km wide at its widest point, and has a maximum depth of less than 45 m.The recording equipment is the Hyd SR-1 hydroacoustic recorder.The core of this recorder is a hydrophone with a sensitivity of −193.5 dB re 1 V/1 uPa and a frequency response range 1 Hz-28 kHz.

Table 5 .
Recognition results of the FEM-ATNN and the multi-resolution STFT.

Table 6 .
Recognition results of model based transfer learning.

Table 8 .
Comparison between the proposed method and other methods used ShipsEar.