A Novel Underwater Acoustic Target Recognition Method Based on MFCC and RACNN

In ocean remote sensing missions, recognizing an underwater acoustic target is a crucial technology for conducting marine biological surveys, ocean explorations, and other scientific activities that take place in water. The complex acoustic propagation characteristics present significant challenges for the recognition of underwater acoustic targets (UATR). Methods such as extracting the DEMON spectrum of a signal and inputting it into an artificial neural network for recognition, and fusing the multidimensional features of a signal for recognition, have been proposed. However, there is still room for improvement in terms of noise immunity, improved computational performance, and reduced reliance on specialized knowledge. In this article, we propose the Residual Attentional Convolutional Neural Network (RACNN), a convolutional neural network that quickly and accurately recognize the type of ship-radiated noise. This network is capable of extracting internal features of Mel Frequency Cepstral Coefficients (MFCC) of the underwater ship-radiated noise. Experimental results demonstrate that the proposed model achieves an overall accuracy of 99.34% on the ShipsEar dataset, surpassing conventional recognition methods and other deep learning models.


Introduction
Underwater acoustic target recognition (UATR) has always been one of the main areas of research in underwater technology.It is commonly applied for marine biological surveys, ocean exploration, and underwater scientific activities [1,2].However, the mechanism of radiated noise generated by underwater targets is very complex, and the radiated noise contains multiple components, including continuous spectrum components and strong discrete spectrums.Furthermore, many factor such as spatiotemporal variations in the underwater acoustic channel, multipath effects, and Doppler effects can all affect the propagation of underwater acoustic signals [3][4][5].As a result, UATR is a highly challenging and difficult technology.
The recognition of underwater target-radiated noise can be divided into two steps: feature extraction and recognition algorithms.Scholars have been trying for decades to artificially extract features from underwater acoustic targets [6][7][8], and the methods include short-time Fourier transform (STFT) [9], low frequency analysis and recording (LOFAR) [10], Mel-frequency spectrum [6], demon noise envelope modulation detection (DEMON) [11], and Mel Frequency Cepstral Coefficients (MFCC) [12].Traditional algorithm, such as GMM [13] and SVM [14], were used for the underwater acoustic field.These manually extracted features and algorithms have played a significant role in the underwater acoustic field, for example, MFCC features were extracted in a lot of work [15][16][17] for UATR.Xin et al. note in the classification of ship-radiated noise signals that the traditional classifier has many limitations.In the ocean, the complex environmental noise, the low SNR of underwater acoustic signals, and the interference of other ship-radiated noise make it difficult for the (2) Compared to other networks, we reduce the number of parameters and effectively highlight crucial information hidden in the time-frequency spectrum.This leads to a reduction in the use of computational resources and an increase in computational efficiency and speed, which makes sense in practical applications.
The structure of this article is as follows; Section 1 is an introduction to the research background of underwater acoustic technology and related technologies.Section 2 describes the characterization and extraction of raw acoustic signals and the design of deep neural networks.Section 3 gives specific experimental results and analysis.The fourth part summarizes the contents of the whole paper and the outlook of future technological progress.

Proposed Method
This section mainly consists of two parts.Part A introduces the MFCC feature extraction process.In this segment, we delve into the details of how the MFCC features are methodically extracted, shedding light on the underlying processes and methodologies.Part B introduces the network structure of the proposed RACNN.This subsection provides a detailed account of its structural components, layers, and mechanisms.

MFCC Feature Extraction
The problems faced in directly recognizing radiated noise from underwater targets are noise interference, spectral time variability, and frequency attenuation.In order to overcome these problems, the Mel Frequency Cepstrum Coefficient (MFCC) is used as the feature representation.The MFCC has the following advantages: (1) Removal of noise interference: the MFCC filters and normalizes the spectrum in the feature extraction process, which can attenuate the interference of noise on the target-radiated noise to a certain extent.
(2) Resistance to temporal variability: the MFCC compresses the spectrum using Discrete Cosine Transform (DCT), which captures the main features of the spectrum and reduces the frequency attenuation.Transform (DCT) to compress the spectrum, which can capture the main features of the spectrum and reduce the effect of time variability.(3) Reduced feature dimension: MFCC converts the spectrum into cepstrum coefficients, which effectively reduces the dimension of the features and makes the subsequent pattern recognition and classification tasks more efficient.
MFCC is a widely used feature in the field of speech and voice recognition, which was first introduced by Davis and Mermelstein in 1980 [25].MFCC feature is an efficient tool for describing and analyzing ship-radiated noise.It achieves this by simulating the characteristics of human ear perception, compressing spectral information, reducing dimensionality, and facilitating accurate recognition [26].The conversion formula between linear frequency and Mel frequency is as follows: where Mel(f ) denotes the Mel frequency and f denotes the Fourier frequency.The extraction of MFCC entails a series of critical steps.Initially, the technique initiates with framing and windowing applied to a time-domain signal, segmenting it into discrete frames.Following this, Fourier transformation is meticulously applied to each frame signal, resulting in the generation of the local spectrum.A pivotal stage in the MFCC extraction process involves the mapping of data from the Fourier domain onto the Mel scale.This transformation is achieved through the utilization of Mel-scale filter banks, imparting a characteristic spectral shape that mirrors the non-linear human auditory system.The transformation equation is as follows: where H m (k) represents the frequency response of the m-th Mel-scale filter and k denotes the frequency index of Discrete Fourier Transform (DFT).f (m) represents the value of the Fourier frequency corresponding to the m-th Mel filter.Each filter is multiplied and summed with the Fourier transform of the signal, resulting in a set of logarithmic energy values, which is as follows: imparting a characteristic spectral shape that mirrors the non-linear human auditory system.The transformation equation is as follows: Each filter is multiplied and summed with the Fourier transform of the signal, resulting in a set of logarithmic energy values, which is as follows: where () sm represents the m-th logarithmic energy value,  The paper performs MFCC feature extraction on experimental samples of duration 1 s.The frame length FFT of the data subframes is 2048 points, and the neighboring frames overlap 1536 points.The number of Mel filters is set to 50, and a total of 44 groups of MFCC data are obtained, which are spliced into a two-dimensional array in chronological order to obtain the MFCC data as shown in Figure 2 below, as a sample for underwater target recognition.This sample is flattened before inputting into the network, and the data length is 2000.The paper performs MFCC feature extraction on experimental samples of duration 1 s.The frame length FFT of the data subframes is 2048 points, and the neighboring frames overlap 1536 points.The number of Mel filters is set to 50, and a total of 44 groups of MFCC data are obtained, which are spliced into a two-dimensional array in chronological order to obtain the MFCC data as shown in Figure 2 below, as a sample for underwater target recognition.This sample is flattened before inputting into the network, and the data length is 2000.

Design of Deep Learning Networks
The input data of the proposed RACNN model consist of the MFCC feature of the radiated noise from underwater targets.These coefficients are flattened along the dimension of the cepstral coefficients to form a one-dimensional sequence (i.e., 1D-MFCC).The conceptualization of the RACNN model was informed by the architectural principles of both ResNet and Attention Model networks.The schematic representation of the proposed RACNN architecture is elucidated in Figure 3, delineating its composition into distinct segments.The model comprises four principal components: three Block_A modules, two Block_B modules, a Fully Connected layer, and a Softmax layer.Each element contributes uniquely to the network's ability to process and understand complex patterns within the input data.
The core functionality of the RACNN unfolds as follows: I network takes a 1D-MFCC sequence as input, a representation of the acoustic features derived from underwater signals.Through the intricate interplay of the defined modules and layers, the network produces an output that corresponds to the category associated with the maximum predicted probability.

Design of Deep Learning Networks
The input data of the proposed RACNN model consist of the MFCC feature of the radiated noise from underwater targets.These coefficients are flattened along the dimension of the cepstral coefficients to form a one-dimensional sequence (i.e., 1D-MFCC).The conceptualization of the RACNN model was informed by the architectural principles of both ResNet and Attention Model networks.The schematic representation of the proposed RACNN architecture is elucidated in Figure 3, delineating its composition into distinct segments.The model comprises four principal components: three Block_A modules, two Block_B modules, a Fully Connected layer, and a Softmax layer.Each element contributes uniquely to the network's ability to process and understand complex patterns within the input data.In deep learning networks, the convolutional layers extract features from the input data.Each convolutional layer consists of multiple convolutional kernels, where each kernel corresponds to a set of weight coefficients and a scalar bias.Given an input sequence x and the weight vector of the convolutional kernel w , the i-th element of the output The core functionality of the RACNN unfolds as follows: I network takes a 1D-MFCC sequence as input, a representation of the acoustic features derived from underwater signals.Through the intricate interplay of the defined modules and layers, the network produces an output that corresponds to the category associated with the maximum predicted probability.
In deep learning networks, the convolutional layers extract features from the input data.Each convolutional layer consists of multiple convolutional kernels, where each kernel corresponds to a set of weight coefficients and a scalar bias.Given an input sequence x and the weight vector of the convolutional kernel w, the i-th element of the output y is defined as follows: where w[j] represents the j-th element of the weight vector w, x[j] represents the j-th element of the input sequence x, and b i is the bias.
In the Block_A in Figure 1, the first layer is Conv_1, which is a convolutional layer with a 1 × 3 kernel size, 1 × 1 stride size, and the activate function Elu, which is as follows: Conv_1 serves two primary purposes: to map low-dimensional input sequence data to high-dimensional representations and to extract abstract features of the input signal.The number of channels in the Conv_1 layer of the first Block_A is set to 256, and the number of channels in the Conv_1 layer of each subsequent Block_A is half that of the previous Block_A.The output of Conv_1 is passed into the Batch Normal (BN) layer, which normalizes and standardizes the data.This process helps accelerate the convergence speed of the network.The following layer is the Max Pooling (MP) layer, which is responsible for downsampling the data.However, the MP layer is replaced by the Average Pooling (AP) layer in the latter two Block_A.After the MP layer, two channels are applied.In the upper channel, a convolution layer Conv_2 with a 1 × 1 kernel size is employed, while a convolution layer Conv_3 with a 1 × 3 kernel size is used in the lower channel.Convolutional kernels of different sizes are able to capture representative features from different scales.The output results of the two channels are concatenated along the channel axis, and the obtained data are added to the MP output data to form a residual structure.
The input vector of Block_A is represented by p ∈ R 1×1×W , the output of Conv_1 is represented by p ′ ∈ R C×1×W , and the output vector by MP layer is P ′′ .The output vector P ′ and P ′′ and the output result of Block_A can be described as follows: where Elu represents the activation function, f 1×3 n represents that the number of convolution kernels is 3, and the size of the convolution kernel is 1 × 3, and Maxpool 4 represents a pooling layer with a size of 4.
The initial layer of Block_B in Figure 3 consists of a convolution layer with 1 × 3 kernel size, followed by a BN layer.The output of this layer is fed into the SE_Block, which employs automatic learning to determine the significance of feature maps in various dimensions.This results in a weight matrix that is multiplied by the feature maps.Channels with a higher weight value indicate greater importance, leading to higher correlation with classification recognition.SE_Block demonstrates the implementation process of the attention mechanism.The implementation process of the SE mechanism is shown in Figure 3, where the numbers of N and C are set to 1/16 of the number of input feature map channels and the number of input feature map channels, respectively.When the data pass through the SE module, each channel of the input feature map is multiplied by a weight factor, which enhances the channels with representative feature maps and reduces the importance of the data of the unimportant channels, which in turn reduces the network's attention to them.The output of the SE_Block is then fed through a convolutional layer and added to the output of the first convolutional layer of Block_B, serving to prevent overfitting and further fit the signal feature.
The subsequent module of Block_B is an FC layer.The input of the FC layer is the flattened output of the convolution layer, the output data include a 1 × 256 vector, and the activate function is Relu.It realizes the end-to-end learning process by unfolding the multi-channel data output from the convolution and mapping it into one-dimensional vectors through a matrix.The concluding layer in the network architecture is the Softmax layer, designed to map the collective outputs of multiple neurons into a probabilistic range between 0 and 1.This critical layer serves to normalize the output data generated by the neurons, ultimately selecting the output with the highest probability as the final prediction.The Softmax function, pivotal in achieving this normalization and probability assignment, is expressed as follows: where S(z i ) represents the probability generated by the i-th neuron after passing through the Softmax layer, n denotes the total number of output neurons, and z i represents the value of the i-th neuron before normalization through the Softmax layer.

Experiment Setup and Dataset
In the course of this study, the dataset for training and validation is derived from the ShipsEar [27] dataset.This comprehensive dataset encompasses sound recordings originating from various vessels, capturing a spectrum of ship-radiated noise levels along the Atlantic coast of Spain during the years 2012 and 2013.Within the dataset, a meticulously curated collection of 91 samples is presented, comprising a diverse array of acoustic scenarios.This dataset encapsulates 11 distinctive types of ship-radiated noise, each offering unique insights into the acoustic landscape.The categories of propagated noise include natural environmental sounds, fish boats, trawlers, mussel boats, tugboats, dredgers, motorboats, pilot boats, sailboats, passenger ships, ocean liners, and ships.Additionally, the dataset incorporates a category dedicated to background noise, providing a comprehensive representation of the acoustic environment prevalent along the Atlantic coast during the specified time frame.The inclusion of these varied noise sources ensures the dataset's richness and relevance to the study's objectives, facilitating a robust evaluation of the proposed RACNN model in the context of underwater target recognition.
The 11 types of ship-radiated noise were classified into 4 categories.Background noise was designated as a separate category.In total, there are five categories in the classification results, as follows: Class A: Natural Ambient Noise.During training, the RACNN model utilized the Adam optimizer with a learning rate of 0.001, a training batch size of 128, and a training period of 60 epochs.The learning rate decay strategy was applied to RACNN, decreasing the learning rate to 0.2 every 20 epochs.

Experiment Results and Discussion
In the RACNN model, Block_A incorporates multi-channel and residual connection mechanisms to enhance feature extraction and improve the convergence rate of the data, as well as to avoid gradient vanishing and explosion issues.Block_B features residual connection and attention mechanisms to prioritize critical information for classification and recognition.The number of Block_A and Block_B, as well as the units of the FC layer, can impact the performance of the RACNN.To evaluate the impact of each block and determine the optimal parameters for the RACNN model, we proposed a series of networks with different configurations.The parameters and performance of the six typical networks, namely Model_1 to Model_6, are presented in Table 2 and Figure 5.The experimental results show that by testing on the ShipsEar dataset, the best accuracy of the model validation set can be achieved as 0.9934, while the network parameters are 149 K and the computational FLOPs(G) is 0.139.mine the optimal parameters for the RACNN model, we proposed a series of networks with different configurations.The parameters and performance of the six typical networks, namely Model_1 to Model_6, are presented in Table 2 and Figure 5.The experimental results show that by testing on the ShipsEar dataset, the best accuracy of the model validation set can be achieved as 0.9934, while the network parameters are 149 K and the computational FLOPs(G) is 0.139.When the number of FC units is equal to 0, Model_1 has the lowest recognition accuracy, while Model_5 has the highest accuracy.This indicates that the presence of more Block_A and Block_B allows for more representative features to be extracted and enhances the ability of the network to recognize underwater targets.Notably, Model_5 exhibits a higher accuracy rate of 1.6% compared to Model_3.However, it also has an additional 20 K parameters.Moreover, the presence of an FC layer can significantly increase recognition accuracy.Specifically, for Model_4, an FC layer with 256 units yielded a 2.9% improvement in accuracy compared to Model_3.It should be noted that enhancing performance by increasing the units in Block_A, Block_B, and FC is not always feasible.Once the scale of network parameters exceeds a certain threshold, the performance may decrease.This may be because excluding the first Block_A, we will reduce the number of channels of the feature map to half when passing through other blocks, which will result in some important information missing after multiple operations.As an illustration, the Model_6 model had a recognition accuracy decrease of 1.2% when an FC layer of 64 units was added, compared to Model_5 without an FC layer.In the following experiments, we choose Model_4, which has the best performance as the RACN model.
In deep learning, the size of the convolutional kernel and the choice of activation and cost functions have a great impact on the overall performance of the network.Activation functions are very important for artificial neural network models to learn deep features of the signal and to understand very complex nonlinear mapping relationships.The size of the convolution kernel affects the ability of the network to capture information and the number of parameters.In Table 3, we compare several classical activation functions, i.e., Relu, Sigmoid, and Elu functions.In addition, we compare the recognition accuracy of the network when using different sizes of convolutional kernels.The data in the table show that at constant convolution kernel size, i.e., when the size of the parameters of the network is constant, the network achieves the best recognition result of 0.9934 using the Elu activation function, which is 1.42% better than the Sigmoid function, and 0.31% better than the Relu function.The Sigmoid function is saturated when the variables take on very large positive or negative absolute values, which means the function becomes flat and insensitive to small changes in the input.The Relu activation function, compared to the Elu function, in the negative region of the derivative transfer will be killed; this phenomenon is called dead Relu, and this will lead to the corresponding parameters never being updated.The size of the convolutional kernel directly affects the number of parameters of the network, which further affects the computation rate.Experimental comparisons show that when the convolutional kernel size is 3, the network not only has the smallest number of parameters but also has the highest recognition accuracy.When the convolutional kernel size is 5 or 7, the network might be unable to capture a small range of features, resulting in the loss of important information and a decrease in accuracy.
The training process of the best RACNN model is shown in Figure 6, which showcases the rapid convergence of the model.The recognition result of the underwater targets test dataset is shown in Figure 7.Each column of the confusion matrix represents the predicted label category, and each row represents the true category of the sample.
In addition, we use accuracy, Recall, and F1-score as performance metrics to characterize the performance of the methodology, and each performance metric is calculated as follows.
Precision = TP  The recognition result of the underwater targets test dataset is shown in Figure 7.Each column of the confusion matrix represents the predicted label category, and each row represents the true category of the sample.In addition, we use accuracy, Recall, and -F1 score as performance metrics to char- acterize the performance of the methodology, and each performance metric is calculated as follows.

FP)
(TP where TP denotes True Positive, FP denotes False Positive, and FN denotes False Neg- ative. The performance metrics, such as accuracy, recall, and -F1 score , can be obtained from the confusion matrix, which are presented in Table 4.The experiment resulted in a lowest recall rate of 0.9918, a lowest accuracy of 99.36%, and a lowest -F1 score of 0.9418.The average recall, accuracy, and -F1 score were 0.9944, 0.9945 and 0.9945, respectively.The performance metrics, such as accuracy, recall, and F1-score, can be obtained from the confusion matrix, which are presented in Table 4.The experiment resulted in a lowest recall rate of 0.9918, a lowest accuracy of 99.36%, and a lowest F1-score of 0.9418.The average recall, accuracy, and F1-score were 0.9944, 0.9945 and 0.9945, respectively.The experimental results reveal that the proposed RACN model can recognize underwater acoustic targets with good accuracy.In order to further evaluate the performance of the proposed RACNN model, we applied some classical deep learning classification networks to recognize underwater targets, such as Vgg16 [28] and Resnet34 [29], and compared the performance of these networks.After these networks were modified to accept underwater target data as input, they were trained using randomly initialized parameters.The experiment results are shown in Table 5, and the training process is shown in Figure 8.The experiment process indicated that RACNN began to converge after approximately 10 iterations due to the incorporation of residual and attention structures.Other networks required around 20 iterations to stabilize.Compared with Vgg16 and ResNet34, the RACN network achieved the highest accuracy of 0.9934, which is 1.04% and 0.22% better than Vgg16 and Resnet34 networks, respectively.We performed a comprehensive comparative analysis by juxtaposing our RACNN network with various other Underwater Acoustic Target Recognition (UATR) Additionally, all the networks achieve a recognition accuracy of over 0.98 for the dataset, indicating the effectiveness of the MFCC feature extraction method.The RACNN has the fewest parameters compared with Vgg16 and ResNet34 networks, whose parameters are 33 M and 21 M, respectively.This means that the RACNN consumes less resources and computing power, achieving faster calculation speed.
We performed a comprehensive comparative analysis by juxtaposing our RACNN network with various other Underwater Acoustic Target Recognition (UATR) methodologies proposed by researchers in the field.The summarized findings are presented in Table 6, offering a clear overview of the performance metrics across different classification methods.Upon meticulous evaluation, it is evident that the fusion of MFCC and RACNN emerges as the standout performer, achieving the highest accuracy while maintaining a notably compact parameter size.Notably, our network exhibits superior recognition performance, showcasing not only the highest accuracy but also the fewest Floating Point Operations (FLOPs) and parameters.These results underscore the effectiveness of the RACNN network in learning distinctive features from underwater acoustic signals in a remarkably efficient manner.The combination of advanced feature extraction methods and the inherent structure of the network contributes to the network's ability to discern and exploit unique characteristics within the acoustic data.The outcome suggests that RACNN excels in capturing and leveraging relevant information, ultimately leading to enhanced recognition performance within a concise parameter space.The results not only affirm its superior performance compared to existing methodologies but also underscore its potential for real-world applications where efficiency and accuracy are paramount considerations.

Conclusions
Traditional methods for identifying underwater target-radiated noise have typically relied on limited features, making them less effective in complex sea conditions.This paper introduced the RACNN model, utilizing MFCC data as input for the recognition of underwater ship-radiated noise.The proposed network was trained and assessed for performance using the ShipsEar dataset.Integrating residual and attention mechanisms, the network combines the strengths of ResNet and Attention models.By leveraging the CNN convolutional module, the RACNN model facilitates self-learning of target features, mitigating issues such as gradient dispersion and vanishing.Simultaneously, the network effectively extracts valuable information, discarding redundant data and prioritizing channels with representative features, thereby enhancing accuracy.Experimental results demonstrate that RACNN achieves an impressive recognition accuracy rate of 99.34%.This surpasses the accuracy and time efficiency of models such as VGG and ResNet.Furthermore, RACNN proves adept at extracting high-level information conducive to classification, showcasing significant potential for underwater target recognition.
This model outperforms other UATR models in both accuracy and parameter scale.In light of these findings, future work will explore avenues for refining and extending the RACNN model, considering additional datasets and further optimizing its performance in diverse underwater scenarios.This ongoing research aims to contribute to the continuous advancement of underwater acoustic target recognition methodologies.
The RACNN model presents a robust and efficient solution for underwater target recognition, demonstrating superior performance and holding promise for applications in various maritime environments.
) where s(m) represents the m-th logarithmic energy value, |X a (k)| 2 represents the power spectrum of the signal, which can be obtained by DFT, and H m (k) represents the frequency response of the m-th Mel-scale filter.Finally, to finalize the extraction of MFCC, Discrete Cosine Transform (DCT) is applied to the s(m).The above extraction process is shown in Figure 1.Sensors 2024, 24, x FOR PEER REVIEW 4 of 16

m
Hk represents the frequency response of the m-th Mel-scale filter and k denotes the frequency index of Discrete Fourier Transform (DFT).f(m) represents the value of the Fourier frequency corresponding to the m-th Mel filter.
power spectrum of the signal, which can be obtained by DFT, and () m Hk represents the frequency response of the m-th Mel-scale filter.Finally, to finalize the extraction of MFCC, Discrete Cosine Transform (DCT) is applied to the () sm .The above extraction process is shown in Figure 1.

Figure 1 .
Figure 1.The extraction process of MFCC.

Figure 1 .
Figure 1.The extraction process of MFCC.

Figure 2 .
Figure 2. A typical sample of underwater target MFCC feature.

Figure 2 .
Figure 2. A typical sample of underwater target MFCC feature.

Figure 3 .
Figure 3. Structure of the proposed RACNN model.
Class B: Fishboat, Trawler, Mussel Boat, Tugboat, Dredger.Class C: Motorboat, Pilot Ship, Sailboat.Class D: Passengers.Class E: Ocean Liner, Roro.The FFT spectrum, the result of Mel filter and the DTC transformation of five types of noise are shown in Figure 4.

Figure 4 .
Figure 4. Results of FFT, Mel filter bank, and DCT of different radiated noise.(a) Natural noise in Class A. (b) Tugboat noise in Class B. (c) Motorboat noise in Class C. (d) Passengers noise in Class A. (e) Ocean liner noise in Class E. The 91 audio recordings from the ShipsEar dataset were split into 1 s intervals to generate MFCC features of the underwater targets as experimental samples.The samples were then randomly split into a training set and a testing set in a 0.8:0.2ratio.Specifically, the training set consisted of 9040 samples, while the testing set comprised 2260 samples.As show in Table 1, of these 9040 training samples, 912 belonged to category A, 1500 to category B, 1248 to category C, 3416 to category D, and 1964 to category E. The 2260 samples in the test set contain 288 belonging to class A, 375 to class B, 312 to class C, 854 to class D, and 491 to class E. The test set is a collection of 2460 samples.

Figure 4 .
Figure 4. Results of FFT, Mel filter bank, and DCT of different radiated noise.(a) Natural noise in Class A. (b) Tugboat noise in Class B. (c) Motorboat noise in Class C. (d) Passengers noise in Class A. (e) Ocean liner noise in Class E. The 91 audio recordings from the ShipsEar dataset were split into 1 s intervals to generate MFCC features of the underwater targets as experimental samples.The samples were then randomly split into a training set and a testing set in a 0.8:0.2ratio.Specifically, the training set consisted of 9040 samples, while the testing set comprised 2260 samples.As show in Table 1, of these 9040 training samples, 912 belonged to category A, 1500 to category B, 1248 to category C, 3416 to category D, and 1964 to category E. The 2260 samples in the test set contain 288 belonging to class A, 375 to class B, 312 to class C, 854 to class D, and 491 to class E. The test set is a collection of 2460 samples.

Figure 5 .
Figure 5. Experimental process for different RACNN models.Figure 5. Experimental process for different RACNN models.

Figure 6 .
Figure 6.The training process of the proposed RACNN model.The recognition result of the underwater targets test dataset is shown in Figure7.Each column of the confusion matrix represents the predicted label category, and each row represents the true category of the sample.

Figure 6 .
Figure 6.The training process of the proposed RACNN model.

Figure 6 .
Figure 6.The training process of the proposed RACNN model.

Figure 7 .
Figure 7. Confusion matrix of RACNN on the test dataset.

Figure 7 .
Figure 7. Confusion matrix of RACNN on the test dataset.

Figure 8 .
Figure 8. Training process of different deep learning models.

Table 1 .
Dataset for training and testing.

Table 2 .
Experimental results for different RACNN models.

Table 2 .
Experimental results for different RACNN models.
Figure 5. Experimental process for different RACNN models.

Table 3 .
Models with different kernel sizes.

Table 4 .
Performance of RACNN model on test dets.

Table 5 .
Experimental results of different networks.

Table 5 .
Experimental results of different networks.
Figure 8. Training process of different deep learning models.

Table 6 .
Experiment result of different feature models.