1. Introduction
Underwater acoustic target recognition is a key area in marine acoustics [
1]. It automatically identifies different target classes by analyzing the radiated noise of underwater targets. This technology is applied in various areas, including underwater monitoring, protection, and the enhancement of security and defense [
2,
3,
4]. In recent years, deep learning has become the main technology used in underwater acoustic recognition systems.
However, underwater acoustic signals in the actual marine environment face numerous challenges, which are vividly illustrated in
Figure 1. Traditional underwater acoustic detection methods predominantly rely on near-field datasets, as shown on the left side. Although data-driven acoustic recognition models based on deep learning demonstrate promising performance on public datasets, the complex time-varying characteristics of the ocean channel pose significant problems [
5,
6]. During propagation, sound waves are influenced by factors such as seawater temperature, salinity, depth, and ocean topography. These factors give rise to phenomena like signal attenuation, scattering, and multi-path propagation, resulting in blurred and distorted received underwater acoustic signals.
In view of this, we have employed the ocean acoustic channel simulation method based on the theory of ocean acoustic propagation to generate the far-field dataset of the ocean channel. This dataset comprehensively takes into account complex factors such as multi-path propagation, effectively overcoming the limitations of traditional near-field datasets.
According to research, most of the existing underwater acoustic recognition literature only relies on class labels as supervision information. The latest progress in machine learning technology provides opportunities to utilize information other than class labels, among which multi-task learning (MTL) can be an attractive solution [
7].
In this study, an MTL framework is used, with “estimating the relative position between the target and the sonar” as the localization task, enabling the recognition task to perceive the robust patterns related to the relative position between the target and the sonar in underwater signals. The labels for the localization task are marked simultaneously when creating the dataset based on the sound field model. Moreover, it can also prompt the model to learn acoustic features related to the target, such as comb filtering and interference fringes, thereby deepening the model’s understanding of the signal.
Inspired by the mixture-of-experts model (MoE) and the multi-gate mechanism [
8,
9], an improved multi-task framework MEG (multi-task, multi-expert, multi-gate) [
7] is adopted to fully exploit the potential of MTL. Specifically, MEG replaces the traditional output layer with multiple independent network layers (expert layers). The expert layers have the same architecture but different parameters, enabling them to specialize in different aspects and provide fine-grained knowledge through independent parameter spaces. In addition, MEG uses multiple gating layers to dynamically learn task-specific weights, allowing each task to linearly combine the outputs of the expert layers with unique weights to obtain task-specific representations. The top-k gating mechanism [
10,
11] is used to dynamically select the top k expert layers according to the importance of the task, improving the model’s efficiency and performance.
In order to verify the superior performance of the multi-task framework and the MEG model, a data augmentation method based on ray theory was used to generate a synthetic dataset of ship-radiated noise in the direct arrival zone and the shadow zone at a sea depth of 3500 m. A series of experiments were then conducted on a DS3500 dataset. The results show that the multi-task model can achieve a high recognition accuracy and also performs well in the localization task. On the DS3500 dataset, MEG can reach an accuracy of 95.93% in the five-class recognition task. The main contributions of this study are as follows:
A ship-radiated noise data augmentation model based on ray theory is proposed for simulating ship-radiated noise signals received by sonar in the ocean channel. This model helps to address the issue of limited underwater acoustic data by generating additional training samples.
An identification framework named MEG is designed. By enabling the model to learn the relative position between the target and the sonar, its ability to capture robust patterns is enhanced.
The top-k gating mechanism is introduced to dynamically screen the top k expert layers according to the importance of the task, improving the model’s efficiency and performance.
A large number of experiments were conducted to verify the superiority of the MEG model in localization and recognition. The model achieves excellent performance on the DS3500 dataset.
4. Results
4.1. Training and Test
We used the features extracted by the Short-Time Fourier Transform (STFT) as the input and utilized the DS3500 device to conduct 150 epochs of training on the MEG network. During the training process, we continuously recorded key parameters such as training loss and classification accuracy, and we visualized these data to more clearly observe the training dynamics and performance of the model.
Figure 9 depicts the loss curve, recognition accuracy, normalized range and depth localization error throughout the training process. As is evident, the loss curve steadily declines during training. This downward trend demonstrates that the network is effectively learning and making progress.
Simultaneously, the recognition accuracy steadily increases. This indicates that the network is successfully capturing relevant features, resulting in improved performance on the recognition task. Concurrently, both the range and depth localization errors gradually decrease. This improvement shows that the network can accurately estimate positions, signifying its effectiveness in the localization task.
Upon completion of the training process, the network performance was evaluated using the test set.
Figure 10 illustrates the receiver operating characteristic (ROC) curves and the confusion matrix.
Figure 10a shows ROC curves with the AUC (area under curve) for each class, reflecting outstanding discriminative ability for every class.
Figure 10b presents the confusion matrix, with the vertical axis as true labels and the horizontal axis as predicted labels, clearly depicting the classification performance for each class.
Figure 11 and
Figure 12 present the classification accuracies for the five types of tasks after the MEG network has been trained along with the range and depth localization results on the first 36 samples. The bar chart shows that there are high class accuracies (with accuracies greater than 0.9 for all classes). The 4th class achieves the highest accuracy, reaching 100%, while the 1st class has the lowest accuracy, at 91.67%. The range and depth plots demonstrate that the predicted values (red dashed lines) closely track the true values (blue solid lines), indicating that the localization network can accurately estimate the range and depth.
4.2. Comparison Experiments of Different Features
To explore the impact of different feature parameters on the performance of the MCL (multi-task classification and localization network, i.e., MEG without MoE module), comparative experiments were designed and conducted. Five different feature types, namely Mel, STFT, MFCC, GFCC, and CQT, were selected for the experiment. The corresponding parameter settings for these feature types are shown in
Table 4.
The experimental results are summarized in
Table 5 and
Figure 13. Evidently, the data reveal that distinct feature parameters exert a substantial influence on the MCL network’s performance. Among the evaluated features, the STFT feature emerges as the top performer in this experiment. It attains an accuracy (ACC) of 96.07%, a mean absolute error for range localization (MAE-R) of merely 0.26 km, and a mean absolute error for depth localization (MAE-D) of 27.68 m. This outstanding performance implies that the STFT feature can proficiently extract the crucial information associated with the tasks, empowering the network to achieve remarkable results in both classification and localization tasks.
The GFCC feature secures the second position. It has an ACC of 96.10%, an MAE-R of 0.34 km, and an MAE-D of 35.42 m. Even though its accuracy is marginally higher than that of the STFT feature, the STFT feature demonstrates superior performance in range and depth localization.
The Mel feature shows an ACC of 94.63%, an MAE-R of 0.52 km, and an MAE-D of 60.61 m. Although it lags slightly behind the STFT and GFCC features in localization metrics, it still sustains a relatively high level of recognition performance, suggesting its potential utility in the MCL network.
On the contrary, the MFCC and CQT features exhibit relatively suboptimal performance. The MFCC feature has an ACC of 89.49%, an MAE-R of 1.27 km, and an MAE-D of 131.78 m. The CQT feature records an ACC of 94.39%, an MAE-R of 1.82 km, and an MAE-D of 210.76 m. It is speculated that these two features may not fully capture the effective task-related information during the extraction process, which consequently impacts the network’s performance to a certain extent.
4.3. Comparison Experiments of Different Network
To compare the performance of the MEG network with that of other networks, a series of comparative experiments were carried out. In this experiment, nine representative network models were selected: MEG (STFT), MCL (STFT), MEG-C (STFT), MEG-L (STFT), DenseNet121 [
67], ResNet18 [
68,
69], MobileNetV2 [
70], ResNet50 [
69], and Swin-Transformer [
71]. Here, ’network (STFT)’ denotes that STFT is employed as the input feature of the network. MEG-C is the classification-task branch network of MEG, and MEG-L is the localization-task branch network of MEG. These two branch networks are used to compare the results between multi-task and single-task scenarios.
DenseNet121, ResNet18, MobileNetV2, ResNet50, and Swin-Transformer are all general-purpose architectures and are employed for the classification task. Given that the input feature dimension of the Short-Time Fourier Transform (STFT) is relatively large, a convolutional layer with a stride of 2 is added at the beginning of these general-purpose network frameworks to reduce the image dimension. These models vary in network depth, structural design, and parameter scale, enabling a comprehensive evaluation of their suitability for multi-task learning and single-task scenarios.
The experimental results are summarized in
Table 6 and
Figure 14. As indicated by the data, the network architecture significantly impacts both classification accuracy and localization precision with distinct performance patterns observed between multi-task and single-task networks.
Among the models under test, the MEG (STFT) network shines brightly, achieving a classification accuracy of 95.93%. It also demonstrates reliable localization performance, with an MAE-R reaching 0.2011 km and an MAE-D of 20.61 m. This fully reflects its ability to effectively integrate classification and localization tasks when handling multi-task learning. The MCL (STFT) network has a slightly higher classification accuracy, hitting 96.07%. However, there is a rather noticeable difference in its performance of the localization task compared to the MEG network with an MAE-R of 0.2565 km and an MAE-D of 27.68 m. This somewhat reveals the design concept of the MCL (STFT) network for multi-task scenarios. Although it shows a certain advantage in classification performance, there may still be room for optimization in terms of balancing the localization accuracy.
In comparison to networks optimized for joint localization and classification, single-task architectures designed exclusively for either localization or classification exhibit distinct performance trade-offs. Specifically, while single-task networks (e.g., MEG-C (STFT) for classification and MEG-L (STFT) for localization) achieve respectable performance—with a classification accuracy of 95.76% and localization metrics of MAE-R (mean absolute error for range) 0.2013 km and MAE-D (mean absolute error for depth) 20.79 m, respectively—the multi-task MEG network outperforms them in both effectiveness and efficiency. Notably, MEG delivers superior performance with only a single training process, demonstrating that multi-task architectures like MEG not only yield better results but also reduce computational overhead through the cross-task integration of feature extraction.
General architectures such as DenseNet121, ResNet18, MobileNetV2, ResNet50, and Swin-Transformer mainly focus on the classification task. Compared with MEG, their performance shows a significant gap. DenseNet121 has a classification accuracy of 86.61%, ResNet18 84.99%, MobileNetV2 83.60%, ResNet50 76.34%, and Swin-Transformer only 63.08%. These models struggle to achieve performance comparable to MEG, which is perhaps because they fail to extract useful features for classification as effectively as MEG does.
In summary, experimental results underscore that architectural design must align with task requirements. Multi-task networks outperform single-task models in overall performance, while single-task counterparts remain competitive in their specific domains.
4.4. Analysis of Network Convergence
To further understand the impact of different network architectures on the convergence process, we analyze the convergence epochs and training costs of several representative networks. The networks under consideration are MEG, DenseNet121, ResNet18, MobileNetV2, ResNet50, and Swin-Transformer. To ensure the fairness of the comparison, this study restricts its analysis to the recognition task alone.
As can be seen from the data in the figures, there are significant differences in the convergence situations of different networks. MEG (STFT) requires 68 epochs to converge (the first epoch at which an accuracy of 80% is achieved); in contrast, MEG (GFCC) takes only 2 epochs. This improvement can be attributed to the GFCC features’ reduced dimensionality in the channel domain (GFCC: 200, STFT: 513), which significantly reduces both the network size and computational complexity of convolution operations within the MEG architecture.
ResNet18 converges in 20 epochs. Its relatively simple architecture gives it an advantage in convergence speed, but when combined with the validation accuracy curve, there may be a trade-off in terms of precision. DenseNet121 converges in 31 epochs, MobileNetV2 in 44 epochs, and ResNet50 in 46 epochs. The different convergence epochs of these networks reflect the differences in the efficiency of feature learning and parameter optimization of their respective architectures.
During the training process, the validation accuracy of Swin-Transformer is always lower than 80%, and there is no convergence trend (no convergence epochs are shown), indicating that in the current task, due to factors such as poor adaptability between its architecture and the task, it is difficult to effectively improve the precision, and its cost - effectiveness is not high.
In terms of training time, as shown in
Figure 17, MEG (GFCC) only takes 10 min, which is significantly advantageous in terms of training time consumption. Combined with its extremely fast convergence epochs, it reflects the good performance of this network in training efficiency when GFCC is used as the input. MEG (STFT) takes 22 min for training. Although the number of convergence epochs is large, the time consumption is still within an acceptable range. ResNet18 has a training duration of 17 min, which matches its relatively fast convergence, demonstrating the characteristics of a simple architecture in terms of training efficiency. DenseNet121 requires 43 min for training, MobileNetV2 takes 21 min, ResNet50 takes 32 min, and Swin-Transformer takes as long as 54 min. Swin-Transformer not only has low accuracy and no sign of convergence but also has the longest training time. Considering both training efficiency and effectiveness, it is not practical in the current task. Overall, the input features and network architecture jointly affect the number of convergence epochs, training duration, and precision performance.
The dual advantages of MEG (GFCC) in convergence efficiency and training time consumption provide a reference for network design and optimization, and they also make the differences in task adaptability among different networks more clear, which is helpful for the subsequent targeted selection or improvement of networks.
4.5. k-Fold Cross-Validation Results Analysis
To evaluate the generalization ability and robustness of different feature inputs combined with network architectures, we conducted k-fold cross-validation experiments. The performance metrics include classification accuracy (ACC), standard deviation (std) of accuracy, MAE-R, standard deviation of MAE-R, MAE-D, and standard deviation of MAE-D. The experimental results for MEG and MCL networks with STFT and GFCC features are summarized in
Table 8 and
Figure 18.
The analysis shows that different models exhibit varying strengths in classification and localization. MCL (GFCC) achieves the highest classification accuracy at 96.10%, but its localization performance is relatively poor. MCL (STFT) outperforms MCL (GFCC) in localization, yet it has a marginal advantage in classification. MEG (GFCC) demonstrates the best localization performance, with MAE-R and MAE-D as low as 0.1707 km and 19.43 m, respectively, though its classification accuracy is slightly lower. MEG (STFT) offers a balanced performance, achieving a classification accuracy of 95.93%, localization errors of 0.2011 km and 20.61 m.
5. Discussion
The relationship between localization and recognition tasks bears resemblance to recognizing an object under diverse weather conditions [
64]. In an underwater context, the ocean acoustic channel impacts the recognition features in a similar way that weather affects visual features.
The signal received by the sonar can be precisely described by the following equations. In the time domain, it is expressed as shown below:
where
denotes the received signal,
represents the ocean channel function,
is the target signal,
stands for the noise, and ∗ symbolizes convolution.
In the frequency domain, the equation is
Here,
is the spectrum of the received signal,
is the frequency-domain representation of the ocean channel function,
is the spectrum of the target,
is the spectrum of the noise, and · indicates multiplication.
From these equations, it is clear that serves for target localization, while is utilized for target recognition. Typically, and exhibit no correlation.
Moreover, a conventional multi-task network aims to share network features to enhance multi-task learning effectiveness. However, in the ocean channel scenario, the features of the ocean channel and those of the target are uncorrelated. To explore this further, two shared one-dimensional convolutional layers were added to the front-end of the network. The recognition results are as follows.
Table 9 clearly shows that the MEG network with shared layers has a significantly lower recognition rate and larger localization errors, indicating its suboptimal performance in both localization and recognition tasks. Compared with other configurations, the shared-layer design proves relatively ineffective in enabling efficient feature extraction and task integration for these two tasks, failing to fully realize the original intention of the multi-task design.
Nevertheless, the concept of using a multi-task network still holds promise. Although the features learned by the two tasks may seem unrelated, this approach allows us to perform both localization and recognition within a single network, making the training process more lightweight and resource-efficient. Moreover, we regard this as a foundation for future research. There may be hidden connections between the two tasks that, once discovered, could enhance overall performance.