A Robust Feature Extraction Method for Underwater Acoustic Target Recognition Based on Multi-Task Learning

: Target classiﬁcation and recognition have always been complex problems in underwater acoustic signal processing because of noise interference and feature instability. In this paper, a robust feature extraction method based on multi-task learning is proposed, which provides an effective solution. Firstly, an MLP-based network model suitable for underwater acoustic signal processing is proposed to optimize feature extraction. Then, multi-task learning is deployed on the model in hard parameter-sharing so that the model can extract anti-noise interference features and embed prior feature extraction knowledge. In the model training stage, the simultaneous training method enables the model to improve the robustness and representation of classiﬁcation features with the knowledge of different tasks. Furthermore, the optimized classiﬁcation features are sent to the classiﬁcation network to complete target recognition. The proposed method is evaluated by the dataset collected in the real environment. The results show that the proposed method effectively improves recognition accuracy and maintains high performance under different noise levels, which is better than popular methods.


Introduction
Underwater acoustic target recognition based on ship-radiated noise received by a hydrophone is a research hotspot.Signal information received by a hydrophone depends on the target characteristics and the marine environment.Features of the signal are closely related to the route state and mechanical working state of the target, which are complex and challenging to describe.The marine environment is usually accompanied by different noise levels, which will weaken the target features and reduce the discrimination of the target.Therefore, the importance of robust feature extraction ability for recognition algorithms is ineffable.
In underwater acoustic target recognition, classical features include time-domain waveform features [1,2], frequency and time-frequency features [3][4][5][6][7][8][9], and auditory perception features [10][11][12][13][14].However, features with larger dimensions are often redundant and difficult to process, and features with smaller dimensions cause a lot of information loss to varying degrees.In addition, the application scenarios and scope of different types of features are limited, and features with different dimensions and complexity have different requirements for the design of classifiers.These reasons lead to poor generalization and limitations of classification models.With the rapid development of deep learning technology, it is a foreseeable reality to complete high-quality feature extraction in large-dimensional features with rich information and even original signals.Deep learning technology promotes advanced intelligent algorithms.Its powerful data learning ability provides the model with strong feature extraction performance, ensuring that it obtains good results in underwater acoustic target location [15][16][17] and recognition [18][19][20][21][22][23][24][25].Many researchers have proposed corresponding algorithms from the perspective of improving feature extraction.Qi et al. [26] proposed an integrated neural network based on feature fusion learning for underwater acoustic target recognition.This method extracts the short-time Fourier transform (STFT) amplitude spectrum, STFT phase spectrum, and bispectrum features of underwater acoustic signals to form the network's input.It uses a shuffled frog leaping algorithm (SFLA) to train the weight coefficients of different networks, achieving higher recognition accuracy and stronger noise robustness.Luo et al. [27] used the restricted Boltzmann machine (RBM) to automatically encode the combined data of the power spectrum and demodulation spectrum of ship-radiated noise without supervision and extract the deep data structure layer by layer to obtain the signal feature vector.Tian et al. [24] proposed a multi-scale residual deep neural network (MSRDN) to construct a deep convolutional stack network.The problem of feature extraction using large convolution kernels in the initial stage of neural networks is improved to avoid the lack of depth and structural imbalance in the network.MSRDN can directly use the original signal waveform as the input and achieve high recognition performance after training.Doan et al. [25] proposed a dense model for underwater target recognition.The proposed model skillfully reuses all former feature maps to optimize recognition accuracy under various impaired conditions while satisfying low computational costs.Cao et al. [21] proposed a second-order pooling convolutional neural network (CNN) model to capture temporal correlation, which improved the performance of maximum pooling in CNN applied to underwater acoustic target recognition.Wang et al. [20] proposed a dimension reduction method to obtain the multi-dimensional fusion features of the original underwater acoustic signal, which ensures the time dimension's consistency.Additionally, the Gaussian mixture model (GMM) was used to modify the structure of the deep neural network (DNN) to obtain high accuracy and strong adaptability.Ke et al. [28] proposed a one-dimensional convolutional autoencoderdecoder model to extract features from high resonance components, proposed a supervised feature separation algorithm to separate further the features extracted in pre-trained, and finally increased the recognition rate.Most methods for improving feature extraction focus on two aspects: 1. improving the primary features of input by signal processing or feature fusion; and 2. improving the structure of the classification neural network by model optimization.Although these representative deep learning-based methods have achieved acceptable results in underwater acoustic target recognition tasks, directly inputting features into the black box classifier for training shields the internal working mechanism of the model and reduce the interpretability and performance.
This paper presents a robust feature extraction method (RFEM) that adds prior knowledge to improve feature robustness and model performance based on multi-task learning.The proposed method not only extracts high-level features using the classification taskdriven model but also guides the model to extract anti-noise interference features and learn prior knowledge based on multi-task learning.RFEM learns to extract manual features based on prior knowledge while learning to resist noise interference, and uses this information to extract small-dimensional robust features to improve the recognition performance of the model.Specifically, the proposed method designs a multi-layer perceptron-based (MLP) module to suppress noise interference on the signal.It generates a robust feature based on multi-task learning that integrates time-frequency, hand-designed, and specific task requirement features.The proposed method is efficient, has practical value, and can be combined with other advanced deep learning-based methods after simple improvement.
The following sections are divided into three parts.Section 2 introduces the details of RFEM, including the components of the MLP module, the training method and inference stage of RFEM, and the loss function.Experiments and discussions are described in Section 3, including multiple ablation experiments and identification experiments under different signal-to-noise ratios.Finally, the conclusion is presented in Section 4.

Proposed Method
Figure 1 shows the details of RFEM with a recognition system as an example, including the training method and inference stage.In short, RFEM utilizes a particular feature extraction network to extract robust features.The extracted features are sent to the subsequent classification network for high-level feature extraction and target discrimination.In the training stage of the model, the robust feature extraction network is trained on an anti-noise task, an a priori knowledge-based feature extraction task, and a classification task based on multi-task learning.The network finds a balanced feature in the three tasks to make it suitable for signal recognition under different noise levels.Given the complexity of the underwater acoustic target radiated noise signal, this paper designs an MLP model to extract features suitable for different tasks.In addition, various neural networks suitable for underwater acoustic target recognition can be adapted for the classification network.In the training stage, the damaged signal and the original signal are used to train the robust feature extraction network and the classification network.After the training, two networks can be cascaded to construct the recognition system.

Proposed Method
Figure 1 shows the details of RFEM with a recognition system as an example, including the training method and inference stage.In short, RFEM utilizes a particular feature extraction network to extract robust features.The extracted features are sent to the subsequent classification network for high-level feature extraction and target discrimination.In the training stage of the model, the robust feature extraction network is trained on an antinoise task, an a priori knowledge-based feature extraction task, and a classification task based on multi-task learning.The network finds a balanced feature in the three tasks to make it suitable for signal recognition under different noise levels.Given the complexity of the underwater acoustic target radiated noise signal, this paper designs an MLP model to extract features suitable for different tasks.In addition, various neural networks suitable for underwater acoustic target recognition can be adapted for the classification network.In the training stage, the damaged signal and the original signal are used to train the robust feature extraction network and the classification network.After the training, two networks can be cascaded to construct the recognition system.As shown in Figure 1, RFEM is introduced with a recognition system as an example.RFEM includes the design of the basic block of the MLP (BBM) module, the design of the robust feature extraction network, the loss and the training method.The three parts are described in detail in the following parts.

Basic Block of the MLP Module
The underwater acoustic channel is a complex time-varying space-varying channel, which makes various characteristics of ship-radiated noise time-varying.In addition, the mechanism of ship-radiated noise is complex, which increases the difficulty of feature extraction.Figure 2 shows the spectrogram of some underwater acoustic targets.It can be seen that some ship-radiated noise has stable characteristics at specific frequencies, and there is also a certain level of time variation, which is typical for a non-stationary signal.These spectrograms are often used as primary features in deep learning-based target recognition algorithms.As shown in Figure 1, RFEM is introduced with a recognition system as an example.RFEM includes the design of the basic block of the MLP (BBM) module, the design of the robust feature extraction network, the loss and the training method.The three parts are described in detail in the following parts.

Basic Block of the MLP Module
The underwater acoustic channel is a complex time-varying space-varying channel, which makes various characteristics of ship-radiated noise time-varying.In addition, the mechanism of ship-radiated noise is complex, which increases the difficulty of feature extraction.Figure 2 shows the spectrogram of some underwater acoustic targets.It can be seen that some ship-radiated noise has stable characteristics at specific frequencies, and there is also a certain level of time variation, which is typical for a non-stationary signal.These spectrograms are often used as primary features in deep learning-based target recognition algorithms.Convolutional neural networks (CNNs) are widely used in image and speech processing.The working mode of CNN provides it with an intrinsic advantage of establishing local spatial relations.For the complex cross-regional relations, CNN needs to rely on layer-by-layer stacking of convolutional layers to improve the receptive field of the neural network and establish it.For images or speech with strong local spatial relations, CNN or time-delay neural networks will be particularly suitable.However, there are no strong local spatial relations for underwater acoustic target signals, especially ship-radiated noise signals.Moreover, much of the available information may be lost in the local space due to the comb filtering characteristics of the underwater acoustic channel and ocean background noise.Therefore, the neural network that establishes local spatial relations layer by layer is unsuitable for underwater acoustic signal processing.The neural network needs to establish cross-regional relations to expand the range of feature searches.This paper proposes the BBM to establish a global cross-regional relation in each layer, so that the neural network can extract stable and reliable robust features as much as possible, especially for damaged signals.The overall architecture of the proposed BBM is shown in Figure 3. Convolutional neural networks (CNNs) are widely used in image and speech processing.The working mode of CNN provides it with an intrinsic advantage of establishing local spatial relations.For the complex cross-regional relations, CNN needs to rely on layer-by-layer stacking of convolutional layers to improve the receptive field of the neural network and establish it.For images or speech with strong local spatial relations, CNN or time-delay neural networks will be particularly suitable.However, there are no strong local spatial relations for underwater acoustic target signals, especially ship-radiated noise signals.Moreover, much of the available information may be lost in the local space due to the comb filtering characteristics of the underwater acoustic channel and ocean background noise.Therefore, the neural network that establishes local spatial relations layer by layer is unsuitable for underwater acoustic signal processing.The neural network needs to establish cross-regional relations to expand the range of feature searches.This paper proposes the BBM to establish a global cross-regional relation in each layer, so that the neural network can extract stable and reliable robust features as much as possible, especially for damaged signals.The overall architecture of the proposed BBM is shown in Figure 3. Convolutional neural networks (CNNs) are widely used in image and speech processing.The working mode of CNN provides it with an intrinsic advantage of establishing local spatial relations.For the complex cross-regional relations, CNN needs to rely on layer-by-layer stacking of convolutional layers to improve the receptive field of the neural network and establish it.For images or speech with strong local spatial relations, CNN or time-delay neural networks will be particularly suitable.However, there are no strong local spatial relations for underwater acoustic target signals, especially ship-radiated noise signals.Moreover, much of the available information may be lost in the local space due to the comb filtering characteristics of the underwater acoustic channel and ocean background noise.Therefore, the neural network that establishes local spatial relations layer by layer is unsuitable for underwater acoustic signal processing.The neural network needs to establish cross-regional relations to expand the range of feature searches.This paper proposes the BBM to establish a global cross-regional relation in each layer, so that the neural network can extract stable and reliable robust features as much as possible, especially for damaged signals.The overall architecture of the proposed BBM is shown in Figure 3.The underwater acoustic target signal is subjected to time-frequency transformation to obtain a two-dimensional spectrogram.These two dimensions represent frequency and time, respectively, as shown in Figure 2. The two dimensions of input or output features of the BBM are defined as the time-axis (T-axis) and the frequency-axis (F-axis).If the spectrogram is directly input into BBM, the time dimension is represented by the T-axis, and the F-axis represents the frequency dimension.For the middle layer of the network, the T-axis still denotes the time dimension after the feature encoding, and F-axis denotes the frequency dimension after the feature encoding.
BBM divides features into patches based on the T-axis and F-axis.This division method enables neural networks to build long-distance relationships in a single network layer.Additionally, the division of patches is similar to VIT [29], but patches of BBM are divided according to the horizontal or vertical axis.The patches divided by the T-axis are called F-patches, and the patches divided by the F-axis are called T-patches.BBM consists of two MLPs [30], Mlp1 and Mlp2.As shown in Figure 3, BBM first sends F-patches of input features to Mlp1 for feature extraction, and Mlp1 will be encoded according to the Fpatches.This encoding method enables the neural network to establish a unique frequency encoding, so that it can reconstruct the concept of frequency in its own way.Secondly, BBM sends the T-patches of the Mlp1 output to Mlp2.Mlp2 encodes a single frequency encoded value in the whole period, so the neural network can easily distinguish the frequency range with a significant response, and the frequency shielded by noise and other interference in a small range of time is compensated.Finally, the residual structure [31] is introduced into the BBM to avoid the disappearance of the gradient, and the features after nonlinear activation are output for the next block.BBM can establish a global feature relation and is suitable for processing ship-radiated noise signals.In the next section, this paper will construct a robust feature extraction network based on BBM, namely the MLP module.BBM divides features into patches based on the T-axis and F-axis.This division method enables neural networks to build long-distance relationships in a single network layer.Additionally, the division of patches is similar to VIT [29], but patches of BBM are divided according to the horizontal or vertical axis.The patches divided by the T-axis are called F-patches, and the patches divided by the F-axis are called T-patches.BBM consists of two MLPs [30], Mlp1 and Mlp2.As shown in Figure 3, BBM first sends F-patches of input features to Mlp1 for feature extraction, and Mlp1 will be encoded according to the F-patches.This encoding method enables the neural network to establish a unique frequency encoding, so that it can reconstruct the concept of frequency in its own way.Secondly, BBM sends the T-patches of the Mlp1 output to Mlp2.Mlp2 encodes a single frequency encoded value in the whole period, so the neural network can easily distinguish the frequency range with a significant response, and the frequency shielded by noise and other interference in a small range of time is compensated.Finally, the residual structure [31] is introduced into the BBM to avoid the disappearance of the gradient, and the features after nonlinear activation are output for the next block.BBM can establish a global feature relation and is suitable for processing ship-radiated noise signals.In the next section, this paper will construct a robust feature extraction network based on BBM, namely the MLP module.In this paper, the MLP module uses an MLP unit to encode the time information, and the encoded features are sent to the BBM sub-module for feature extraction.The BBM submodule is composed of four BBMs in series, and its output is sent to a mask generator to generate a mask.The mask is multiplied by the input spectrum to filter noise and extract robust features suitable for the classification network.The MLP module provides a neural network suitable for processing underwater acoustic signals.The next section introduces the method of training the network.

Training Method of RFEM-Based Recognition Systems
There are no massive training samples for most underwater acoustic target intelligent recognition models.It is difficult for recognition algorithms to extract reliable features from noisy signals.In addition, many scholars have studied ship-radiated noise and proposed stable classical manual feature extraction methods, such as the frequency-selection method.Therefore, this paper proposes a multi-task learning method to train RFEM, so that RFEM learns to resist noise interference and learns to knowledge of frequency-selection to complete the extraction of manual features.The schematic diagram of the multi-task strategy is shown in Figure 5.
submodule is composed of four BBMs in series, and its output is sent to a mask generator to generate a mask.The mask is multiplied by the input spectrum to filter noise and extract robust features suitable for the classification network.The MLP module provides a neural network suitable for processing underwater acoustic signals.The next section introduces the method of training the network.

Training Method of RFEM-Based Recognition Systems
There are no massive training samples for most underwater acoustic target intelligent recognition models.It is difficult for recognition algorithms to extract reliable features from noisy signals.In addition, many scholars have studied ship-radiated noise and proposed stable classical manual feature extraction methods, such as the frequency-selection method.Therefore, this paper proposes a multi-task learning method to train RFEM, so that RFEM learns to resist noise interference and learns to knowledge of frequency-selection to complete the extraction of manual features.The schematic diagram of the multitask strategy is shown in Figure 5.For the MLP module, the single mask generator is replaced with three mask generators during the training stage.These three generators are used for different tasks: a robust feature extraction task for classification, an anti-noise task, and an optimized frequencyselection task.All three tasks employ the mask generators to generate masks, then use masks to shield interference information on the original spectrum and extract useful information.The three tasks are similar, so they have the effect of promoting each other.

Anti-Noise Task
The purpose of the anti-noise task is to make the model learn to resist different levels of noise and avoid interference in classification.Gaussian white noise with specified power is added to the original signal to form a damaged signal interfered by noise.The anti-noise ability of RFEM is trained on an anti-noise task based on these samples, as shown in Figure 6.The anti-noise task is called Task 1 in this paper.For the MLP module, the single mask generator is replaced with three mask generators during the training stage.These three generators are used for different tasks: a robust feature extraction task for classification, an anti-noise task, and an optimized frequencyselection task.All three tasks employ the mask generators to generate masks, then use masks to shield interference information on the original spectrum and extract useful information.The three tasks are similar, so they have the effect of promoting each other.

Anti-Noise Task
The purpose of the anti-noise task is to make the model learn to resist different levels of noise and avoid interference in classification.Gaussian white noise with specified power is added to the original signal to form a damaged signal interfered by noise.The anti-noise ability of RFEM is trained on an anti-noise task based on these samples, as shown in Figure 6.The anti-noise task is called Task 1 in this paper.
The purpose of Task 1 is to optimize the damaged signal and extract features similar to the original signal.Gaussian white noise is added to the original signal according to Equation (1): where n is Gaussian white noise, s is the original signal, and d denotes the damaged signal.The energy of the added signal is calculated according to Equation ( 2): where SNR represents the signal-to-noise ratio, E(s) is the energy of the original signal, and E(n) is the energy of the noise signal.It is worth noting that the signal-to-noise ratio here is not the actual signal-to-noise ratio of ship-radiated noise.The original and damaged signals are sent to the primary feature extractor to extract the primary features.The original signal is optionally autocorrelated before being fed into the primary feature extractor.
The time-frequency feature used in this paper is the spectrum.After the primary feature extraction, the features of the damaged signal are input into RFEM, and the optimized features consistent with the input feature dimension are output.The model is trained to optimize the damaged signal and extract the same features as the original signal.The purpose of Task 1 is to optimize the damaged signal and extract features similar to the original signal.Gaussian white noise is added to the original signal according to Equation (1): where n is Gaussian white noise, s is the original signal, and d denotes the damaged signal.The energy of the added signal is calculated according to Equation ( 2): where SNR represents the signal-to-noise ratio,

) (s E
is the energy of the original signal, and ) (n E is the energy of the noise signal.It is worth noting that the signal-to-noise ratio here is not the actual signal-to-noise ratio of ship-radiated noise.The original and damaged signals are sent to the primary feature extractor to extract the primary features.The original signal is optionally autocorrelated before being fed into the primary feature extractor.The time-frequency feature used in this paper is the spectrum.After the primary feature extraction, the features of the damaged signal are input into RFEM, and the optimized features consistent with the input feature dimension are output.The model is trained to optimize the damaged signal and extract the same features as the original signal.

Optimized Frequency-Selection Task
Ships are equipped with many complex types of machinery, including power systems and other auxiliary mechanical systems.These machines inevitably produce friction, collision, and vibration when they work or the ship moves, thus spreading into the ocean.These noises are likely to contain specific frequency components for target recognition [8].

Optimized Frequency-Selection Task
Ships are equipped with many complex types of machinery, including power systems and other auxiliary mechanical systems.These machines inevitably produce friction, collision, and vibration when they work or the ship moves, thus spreading into the ocean.These noises are likely to contain specific frequency components for target recognition [8].This paper introduces an optimized frequency selection method to select important frequency components.The frequency selection task is designed in the training stage, called Task 2. Time-frequency analysis of a ship-radiated noise signal can obtain the change in the frequency component with time.For the information obtained by time-frequency analysis, the intensity of the frequency component is calculated according to Equation (3): where S denotes the optimized time-frequency spectrum matrix, T represents the length of the time dimension, and F is the frequency intensity vector.The frequency components with high intensity are screened according to Equation (4): where F denotes the mean value of frequency intensity vector, f represents the maximum frequency value, a represents the frequency threshold, and A represents a frequency selec-tion vector.After obtaining A, another optimized frequency selection vector is calculated according to Equation ( 5): where k c (•) denotes the c-th kernel average smoother, N represents the total number of kernel average smoother, and B represents the another frequency selection vector.After obtaining two frequency selection vectors, the frequency intensity vector is optimized according to Equation ( 6): where L denotes optimized frequency intensity vector.The optimized frequency-selection task is designed according to the optimized frequency intensity vector, as shown in Figure 7.
F denotes the mean value of frequency intensity vector, f represents the max- imum frequency value, a represents the frequency threshold, and A represents a fre- quency selection vector.After obtaining A , another optimized frequency selection vector is calculated according to Equation ( 5): where ) (⋅ c k denotes the c-th kernel average smoother, N represents the total number of kernel average smoother, and B represents the another frequency selection vector.After obtaining two frequency selection vectors, the frequency intensity vector is optimized according to Equation ( 6): where L denotes optimized frequency intensity vector.The optimized frequency-selection task is designed according to the optimized frequency intensity vector, as shown in Figure 7.

Training Strategy and Loss Function
The previous section introduces the anti-noise and optimized frequency-selection tasks in detail.This section introduces how to use these tasks to extract robust features and design loss functions.This paper presents a multi-task learning method that uses antinoise tasks, optimized frequency-selection tasks as auxiliary tasks, and classification tasks as main tasks.The same optimized normalization approach is used to handle features in

Training Strategy and Loss Function
The previous section introduces the anti-noise and optimized frequency-selection tasks in detail.This section introduces how to use these tasks to extract robust features and design loss functions.This paper presents a multi-task learning method that uses anti-noise tasks, optimized frequency-selection tasks as auxiliary tasks, and classification tasks as main tasks.The same optimized normalization approach is used to handle features in different tasks.Three tasks are collaboratively trained to search for common features automatically under the hard parameter-sharing mechanism.Driven by the training samples, the model is guided to extract robust features and complete target classification, as shown in Figure 8.
In the training stage, the damaged signal is first generated by original signal, and then the time-frequency analysis module is used to extract the time-frequency features.Finally, the time-frequency features of the damaged signal and the original signal are sent to RFEM to complete the three tasks.In this paper, Task 1 and Task 2 use the L1 loss training model, and Task 3 employs the negative likelihood loss training model.Although the three tasks are trained by independent loss, the three tasks are trained simultaneously according to Equation (7): Simultaneous training and the hard parameter-sharing mechanism enable RFEM to extract robust features suitable for classification tasks.different tasks.Three tasks are collaboratively trained to search for common features a tomatically under the hard parameter-sharing mechanism.Driven by the training sa ples, the model is guided to extract robust features and complete target classification, shown in Figure 8.

Experimental Dataset
This paper conducted experiments using ShipsEar [32], which was developed by t research group of the University of Diego on the Spanish Atlantic coast and is curren widely used by researchers.The dataset was recorded by hydrophones deployed fro docks to capture different ship noises corresponding to docking or undocking maneuve The autonomous acoustic digitalHyd SR-1 recorder was used to record data.The record had a nominal sensitivity of −193.5 dB re 1V/1 uPa and a flat response in the 1 Hz-28 k frequency range.Annotated information includes recording technology and environme tal and other conditions during collection.Finally, the dataset was made up of 90 reco ings in wav.The recordings belonged to five categories: ocean noise and four differe types of ship targets, as shown in Table 1.

Experimental Dataset
This paper conducted experiments using ShipsEar [32], which was developed by the research group of the University of Diego on the Spanish Atlantic coast and is currently widely used by researchers.The dataset was recorded by hydrophones deployed from docks to capture different ship noises corresponding to docking or undocking maneuvers.The autonomous acoustic digitalHyd SR-1 recorder was used to record data.The recorder had a nominal sensitivity of −193.5 dB re 1V/1 uPa and a flat response in the 1 Hz-28 kHz frequency range.Annotated information includes recording technology and environmental and other conditions during collection.Finally, the dataset was made up of 90 recordings in wav.The recordings belonged to five categories: ocean noise and four different types of ship targets, as shown in Table 1.Each category contained one or more targets, and the duration of each audio segment ranged from 15 s to 10 min.The data were pre-processed by removing the blank signal and segmenting all accords to a fixed duration of 3s, which resulted in 3626 labeled sound samples.Two dataset partitioning methods [33] were used in the experiments.The first method randomly sorted all samples, and the ratio of the training set to the test set was 4:1, named Dataset A. The other method involved taking only four types of target samples, and sorting the samples from the same record according to time.The samples in the front were test samples, and the samples in the back were training samples; the ratio of training data to test data was 3:1.The dataset divided in the second way was named Dataset B. Dataset B was more challenging than Dataset A in recognition tasks and is more suitable for practical applications, which is equivalent to using samples for a period of time to train the model to predict the category of targets for another time.

Experimental Setup
In this paper, three experiments were designed to verify the proposed method, including basic experiments, feature comparison experiments with observed data, and comparison experiments with published algorithms.The basic experiment compared the performance differences between the proposed method and the features extracted directly by the primary feature extractor.The primary features used here were classic short-time Fourier transform spectral features.The second experiment compared the proposed method with the popular feature-based method [34], including MFCC, F-Bank, and CQT.Additionally, the proposed method was compared with the published methods in the final experiment.The first two experiments were based on the challenging Dataset B to evaluate the proposed method comprehensively.Additionally, the last experiment used Dataset A to be consistent with the comparison method.Mfcc and F-bank had a window size of 2048, a jump length of 512, and the number of frequency bands was 40 and 128, respectively.In the experiment, we added different levels of Gaussian white noise to the observed data to simulate signals with different signal-to-noise ratios, and evaluated the robustness of the method to noise.Among them, noise was added to simulate different signal-to-noise ratio levels, ranging from −5 dB to 30 dB, to facilitate a more comprehensive demonstration of algorithm performance.The final experiment simulated signal-to-noise ratio levels from −10 dB to 5 dB, which was convenient to maintain consistency with the comparison methods.
The classic VGGish [35] was used as the classification network to construct the classifier.VGGish has been widely used, which can objectively reflect the classification performance brought about by feature improvement.All the raw audio recordings were resampled to 4 kHz.In the training stage, the Adam algorithm was used as an optimizer with the default parameters.The model had a total of 50,000 training steps, and the initial learning rate was 0.0001.When the number of training steps reached 30,000, the learning rate was reduced to one-tenth of the initial learning rate.

Basic Experiment
The basic experiment compared the performance differences between the proposed method and the features directly extracted by the primary feature extractor.In the experiment, the short-time Fourier spectrum extractor was used as the primary feature extractor.In other words, this part directly compared the performance between the robust features extracted by RFEM and the classical time-frequency features.During the experiment, classification networks were configured with the same parameters, and the following experiments also followed this rule.Precision, recall, f1-score, and accuracy were used as evaluation indicators.The experimental results are shown in Table 2. Table 2 shows that the proposed method is ahead of STFT regarding various evaluation indicators, and the proposed method improves the accuracy by 5.1%, which proves that the proposed method can improve the recognition performance.In order to further verify the robustness of the proposed method, recognition experiments with different Gaussian noise levels were performed again on the dataset.In the experiment, Gaussian white noise with specified power was added to the data, and the results are shown in Figure 9.
indicators, and the proposed method improves the accuracy by 5.1%, which proves proposed method can improve the recognition performance.In order to further verify bustness of the proposed method, recognition experiments with different Gaussian no els were performed again on the dataset.In the experiment, Gaussian white noise wit fied power was added to the data, and the results are shown in Figure 9. Figure 9 shows that the proposed method significantly improves the anti-no formance of the recognition model.Using the robust features generated by RFEM f sification, the accuracy is almost still the same when dealing with slight noise inter Even under −5 dB, the accuracy is reduced by no more than 5%.In contrast, the cla tion model adapted to STFT features has feeble anti-noise performance.The two m are based on the same primary feature extractor and classification network, but t formance is very different.The proposed method's RFEM and multi-task learning s effectively improve the feature extraction and classification performance.

Comparison of Popular Feature-Based Methods
In this experiment, the proposed method was compared with methods based ular features, including MFCC, F-Bank, and CQT.These features are widely used learning-based recognition methods.Experiments were conducted at different no els, and the results are shown in Figure 10. Figure 9 shows that the proposed method significantly improves the anti-noise performance of the recognition model.Using the robust features generated by RFEM for classification, the accuracy is almost still the same when dealing with slight noise interference.Even under −5 dB, the accuracy is reduced by no more than 5%.In contrast, the classification model adapted to STFT features has feeble anti-noise performance.The two methods are based on the same primary feature extractor and classification network, but the performance is very different.The proposed method's RFEM and multi-task learning strategy effectively improve the feature extraction and classification performance.

Comparison of Popular Feature-Based Methods
In this experiment, the proposed method was compared with methods based on popular features, including MFCC, F-Bank, and CQT.These features are widely used in deep learning-based recognition methods.Experiments were conducted at different noise levels, and the results are shown in Figure 10.It can be seen from Figure 10 that the proposed method has better classification performance than several classical features without additional noise.For different levels of noise, different features show different performances.CQT performs well in the case of high SNR, but it cannot resist noise well.MFCC and F-Bank have similar anti-noise performance, but F-Bank is slightly ahead of MFCC.It is worth mentioning that the proposed method is entirely ahead of these classical features.It can be seen from Figure 10 that the proposed method has better classification performance than several classical features without additional noise.For different levels of noise, different features show different performances.CQT performs well in the case of high SNR, but it cannot resist noise well.MFCC and F-Bank have similar anti-noise performance,

Figure 1 .
Figure 1.The illustration of an RFEM-based recognition system.

Figure 1 .
Figure 1.The illustration of an RFEM-based recognition system.

Figure 3 .Figure 2 .
Figure 3. Schematic diagram of basic block of the MLP module.The underwater acoustic target signal is subjected to time-frequency transformation to obtain a two-dimensional spectrogram.These two dimensions represent frequency and time, respectively, as shown in Figure2.The two dimensions of input or output features of the BBM are defined as the time-axis (T-axis) and the frequency-axis (F-axis).If the spectrogram is directly input into BBM, the time dimension is represented by the T-axis, and the F-axis represents the frequency dimension.For the middle layer of the network,

Figure 3 .Figure 3 .
Figure 3. Schematic diagram of basic block of the MLP module.
This section introduces the MLP module design based on the BBM constructed above, called robust feature extraction network.The construction of a robust feature extraction network aims to solve three defects in the classical deep learning-based underwater acoustic target recognition algorithm: 1. the classic recognition network is trained on a spectrogram, and the recognition accuracy is limited due to insufficient feature extraction performance; 2. the anti-noise ability of classical features is insufficient; and 3. the manual features extracted based on human prior are challenging to fuse with the features extracted automatically by the machine.The MLP module for extracting robust features is shown in Figure 4.
This section introduces the MLP module design based on the BBM constructed above, called robust feature extraction network.The construction of a robust feature extraction network aims to solve three defects in the classical deep learning-based underwater acoustic target recognition algorithm: 1. the classic recognition network is trained on a spectrogram, and the recognition accuracy is limited due to insufficient feature extraction performance; 2. the anti-noise ability of classical features is insufficient; and 3. the manual features extracted based on human prior are challenging to fuse with the features extracted automatically by the machine.The MLP module for extracting robust features is shown in Figure4.

Figure 4 .
Figure 4. Schematic diagram of the MLP module.

Figure 4 .
Figure 4. Schematic diagram of the MLP module.

Figure 7 .
Figure 7. Schematic diagram of optimized frequency-selection task.Based on the powerful feature extraction performance of RFEM, Task 2 guides the model learning to select the frequency suitable for classification.It is worth noting that when training the model based on Task 2, the frequency is still filtered based on the mask structure in Figure 5.

Figure 7 .
Figure 7. Schematic diagram of optimized frequency-selection task.Based on the powerful feature extraction performance of RFEM, Task 2 guides the model learning to select the frequency suitable for classification.It is worth noting that when training the model based on Task 2, the frequency is still filtered based on the mask structure in Figure 5.

Figure 8 .
Figure 8.The training strategy of multi-task learning.
the hard parameter-sharing mechanism enable RFEM extract robust features suitable for classification tasks.

Figure 8 .
Figure 8.The training strategy of multi-task learning.

Figure 9 .
Figure 9. Experimental results with different noise levels.

Figure 9 .
Figure 9. Experimental results with different noise levels.

Table 1 .
The type of noise contained in the dataset used in the experiments.

Table 1 .
The type of noise contained in the dataset used in the experiments.

Table 2 .
Classification accuracy of the basic experiment.