To assess the effectiveness of the DART-MT in real underwater settings, we employed a widely used underwater acoustic dataset to enable comprehensive qualitative and quantitative analyses. This includes comparing DART-MT with current models for underwater acoustic target recognition.
In this section, we present an experimental evaluation of the proposed solution, conducted using three publicly available benchmark datasets with diverse characteristics. We demonstrate the effectiveness of the proposed method and delve into the contributions of each module within DART-MT to gain a deeper understanding of its operational mechanisms. Additionally, we performed several in-depth analyses to address the following research questions (RQs):
RQ1: How does the performance of DART-MT compare to that of SOTA methods in real underwater settings?
RQ2: What is the contribution of the DART-MT key modules to its performance improvement?
Most of the comparisons in this study were made with models such as Resnet18, VIT, DenseNet121, and EfficientNetB0 for several reasons. First, these models are widely used in the field of deep learning for various tasks, including image and audio recognition. Their popularity indicates that they have demonstrated good performance in general-purpose feature extraction and classification. Second, they represent different types of neural network architectures. Resnet18 is a classic residual network that effectively addresses the vanishing gradient problem, allowing for deeper network training. VIT, on the other hand, is a Transformer-based model that has shown excellent performance in handling global context information. DenseNet121 has a unique dense connection structure that promotes feature reuse and efficient learning. EfficientNetB0 is designed to balance model complexity and performance, achieving high accuracy with relatively fewer parameters and computational resources. By comparing DART-MT with these models, we can comprehensively evaluate its performance from multiple aspects, including feature extraction ability, model complexity, and generalization performance. This comparison helps to clearly position the DART-MT model in the existing research framework and highlight its advantages and potential areas for improvement.
4.5. Ablation Study (RQ2)
4.5.1. Feature Ablation Experiment
The mean teacher approach is inapplicable to fully labeled data (100% labels), so feature ablation experiments were solely conducted on the DART model. Prior research has validated DART’s effectiveness via its unique architecture and innovative feature-extraction mechanisms, demonstrating its capability to handle complex data patterns and extract discriminative features essential for accurate underwater acoustic target recognition. As an extension of DART, the DART-MT model integrates the mean teacher semi-supervised framework while inheriting DART’s core advantages. DART’s proven feature-extraction capabilities provide a robust foundation for DART-MT, with the added semi-supervised components designed to leverage unlabeled data and further enhance generalization. By optimizing feature learning through the mean teacher framework, DART-MT is expected to build on DART’s strengths, and positive outcomes from DART’s ablation experiments are anticipated to validate DART-MT’s effectiveness in real-world underwater acoustic scenarios.
To assess the characterization capability of the proposed feature extraction methods for raw underwater acoustic signals,
Table 7 compares performance across multiple approaches on the ShipsEar dataset: original 2D features, corresponding 3D features, and the 3D feature fusion method described in
Section 3.4 of the DART model.
Experimental results demonstrate the TriFusion block’s superiority over traditional feature extraction methods across multiple deep learning architectures. When integrated with ResNet variants, TriFusion achieves the highest accuracy: 96.43% for ResNet18, 96.22% for ResNet34, 95.30% for ResNet50, and 94.81% for ResNet101—outperforming MFCC, 3D_MFCC, FBank, and CQT. This highlights TriFusion’s exceptional capability to extract discriminative features from underwater sonar signals.
Similar trends are observed in the EfficientNet series. For example, EfficientNet-B0 achieves 97.10% accuracy with TriFusion, compared to 95.74% for FBank (the second-best performer), with consistent improvements across B1, B2, and B3 variants. This indicates TriFusion’s effectiveness in enhancing feature representation for lightweight architectures.
Notably, TriFusion demonstrates remarkable synergy with the DART model, boosting accuracy to 98.66%—a significant improvement over single features. Even in DenseNet121, TriFusion achieves 97.99% accuracy, outperforming all other methods. These results underscore TriFusion’s versatility in optimizing diverse network architectures, enabling comprehensive capture of acoustic signatures and driving performance gains in underwater acoustic target recognition (UATR).
Compared with single features (e.g., MFCC, FBank, CQT) and simple extensions (e.g., 3D_MFCC), TriFusion’s fused feature design offers distinct advantages. By integrating multi-dimensional information from complementary signal domains, it provides richer, more comprehensive feature representations, thereby significantly enhancing model accuracy in UATR tasks.
While accuracy data enables preliminary assessments, deeper analysis of classification performance requires additional metrics. Confusion matrices and t-SNE plots for ResNet18, EfficientNet-B0, DenseNet121, and DART offer critical insights:
Confusion matrices visually identify misclassification patterns across target categories, exposing weaknesses in feature recognition and decision-making processes.
t-SNE plots project high-dimensional feature vectors into low-dimensional space, enabling visual evaluation of inter-class separability and feature discriminability.
Together, these analyses complement tabular data by providing multi-faceted perspectives on model behavior, facilitating comprehensive and rigorous performance evaluation in UATR tasks.
In the research of underwater acoustic target recognition, we conduct
Figure 7 an in-depth analysis of the performance of different models (RESNET18, Efficientnet_B0, Densnet121, DART) on the ShipsEar training dataset through confusion matrices and t-SNE plots, with a particular focus on the superiority of the TriFusion block feature in the DART model.
From the vantage point of the confusion matrix, the DART model exhibits remarkable classification prowess. For example, when examining the confusion matrix of RESNET18 (
Figure 7a), while it can achieve a certain number of correct classifications for some classes, there are still instances of misclassification. Certain classes may be confused, resulting in inaccurate predictions for some samples. Similar scenarios are also evident in the confusion matrices of Efficientnet_B0 and Densnet121 (
Figure 7c,e), with varying degrees of classification errors. In contrast, the confusion matrix of the DART model (
Figure 7g) showcases higher accuracy. Notably, when the TriFusion block feature is employed, the diagonal elements typically have high values. This implies a larger number of correctly classified samples, as the true and predicted classes align closely for each category, significantly reducing misclassifications. Evidently, the TriFusion block feature offers a more accurate classification foundation for the DART model.
The t-SNE plot further uncovers the DART model’s advantage in feature separability. In the t-SNE plots of RESNET18, Efficientnet_B0, and Densnet121 (
Figure 7b,d,f), one can observe a certain degree of overlap among different classes. This indicates that the features extracted by these models face challenges in differentiating between classes, suggesting insufficient feature separability. Conversely, in the DART model’s t-SNE plot (
Figure 7h), the utilization of the TriFusion block feature leads to a more distinct separation of points belonging to different classes. Each class’s distribution becomes more concentrated, with well-defined boundaries. Visually, this demonstrates the TriFusion block feature’s stronger discriminative power, enabling the DART model to more effectively distinguish various underwater acoustic target classes within the high-dimensional feature space.
To further validate the effectiveness of the TriFusion module in capturing multi-scale acoustic features, we designed a set of quantitative experiments to analyze the contributions of individual features (MFCC from original signals, CQT from differential signals, and FBank from cumulative signals) and their fused representations. Existing studies have shown that single-modality features (such as MFCC for spectral envelopes from original signals, CQT for transient time–frequency analysis from differential signals, and FBank for low-frequency energy characterization from cumulative signals) have limited representational capabilities in complex marine environments, as marine acoustic signals exhibit multi-dimensional coupling characteristics in time-domain dynamics, frequency-domain structures, and energy distributions.
In this experiment, we first extracted MFCC features from original acoustic signals after frame processing, conducted CQT time–frequency analysis on the envelopes of first-order differential signals extracted via Hilbert transform, and extracted FBank features from cumulative signals after wavelet decomposition. These three types of features were then stacked along the channel dimension into a 3 × 128 × 216 tensor to achieve complementary information modeling of time-domain transients, frequency-domain structures, and energy distributions. Based on the DART-MT model architecture and using a semi-supervised learning setup with only 10% labeled data, we compared the performance of single-feature models (MFCC-only from original signals, CQT-only from differential signals, FBank-only from cumulative signals) against the TriFusion module on the ShipsEar dataset. Evaluation metrics included accuracy, F1-scores under class imbalance, and recall rate differences for typical targets.
Combined with the quantitative analysis results of the TriFusion module and single-feature models in
Table 8,
Figure 8 further presents the confusion matrices of four features (TriFusion block, MFCC-only, CQT-only, Fbank-only) in a visual way, intuitively demonstrating the prediction distribution and error patterns of each model among the five types of samples (Class A to Class E).
In terms of overall performance, the TriFusion module achieves an average precision, recall, and F1-score of 94.45%, 95.15%, and 94.78%, respectively, significantly outperforming single-feature models such as Fbank-only (85.99%, 89.63%, 87.71%), MFCC-only (82.39%, 83.24%, 82.53%), and CQT-only (62.54%, 66.81%, 63.63%). This indicates that the feature fusion strategy effectively integrates the advantages of different features, particularly demonstrating prominent superiority in complex categories like Class B. The TriFusion achieves an F1-score of 88.14% in Class B, compared to only 45.35% for CQT-only, and 74.01% and 76.19% for MFCC-only and Fbank-only, respectively.
In-depth analysis reveals that MFCC’s lower precision in Class A and Class B may be attributed to insufficient capture of temporal features or limited sensitivity to timbre changes; CQT’s mere 35.00% discriminative ability in Class B highlights its over-reliance on rhythm/pitch features, leading to weak generalization; Fbank’s coexistence of high recall and low precision in Class B indicates its poor discrimination of category boundaries. By contrast, TriFusion forms a multi-dimensional representation by fusing MFCC’s spectral envelope, CQT’s pitch features, and Fbank’s auditory perception features, achieving F1-scores exceeding 88% across all categories and significantly reducing the bias of single features, especially in difficult categories.
The confusion matrix in
Figure 8 visually corroborates the above conclusions: TriFusion’s confusion matrix exhibits a higher proportion of diagonal elements, with significantly more correct predictions for each category and a more uniform distribution of off-diagonal errors. In contrast, the confusion matrix of CQT-only in Class B shows frequent errors of misclassifying other categories as Class B through numerous off-diagonal elements, while MFCC-only and Fbank-only display obvious misclassification tendencies due to false negatives and false positives, respectively. This visual analysis concretely demonstrates the practical value of multi-feature fusion in reducing classification bias and improving the accuracy of complex category recognition, providing solid evidence for the application of TriFusion in real-world scenarios.
The underlying reason for this superiority lies in the unique characteristics of the TriFusion block feature. By integrating multiple types of feature information, it can capture the features of underwater acoustic signals more comprehensively and precisely. Compared to the features utilized by other models, the TriFusion block feature can supply the DART model with richer and more representative information, thereby enhancing the model’s performance in classification and feature extraction.
4.5.2. Module Ablation Experiment
To conduct a detailed analysis of the functionality and efficiency of DART-MT, we performed ablation studies on different submodules in DART-MT.
The variant models of DART-MT consist of the following structures. Notations are used only for simplicity.
DART-MT (w/o ResNeXt18), denoted as S1, removes the feature extraction operation using ResNeXt18 in the local feature extraction part and replaces it with a standard convolutional layer for essential feature extraction.
DART-MT (w/o New Transformer Encoder), denoted as S2, removes the multi-head self-attention and related sequence processing operations in the New Transformer Encoder in all corresponding tasks.
DART-MT (w/o CBAM), denoted as S3: Remove the channel and spatial attention mechanisms in the Convolutional Block Attention Module (CBAM).
DART-MT: complete structure of DART-MT.
Table 9 presents ablation study results using precision, recall, and F1-score as evaluation metrics for training and testing on the ShipsEar dataset, with the full DART-MT architecture serving as the baseline. Key findings reveal that each submodule contributes uniquely to model performance:
ResNeXt18, as the foundational feature extractor employing grouped convolutions, is critical for hierarchical feature representation and fine-grained detail capture. Removing ResNeXt18 (DART-MT → S1) caused significant performance degradation: precision for Class B dropped from 1.00 to 0.72, recall for Class A fell from 0.97 to 0.83, and the average F1-score decreased from 0.9619 to 0.8842. Its absence weakens feature expressiveness, as it adaptively reconstructs information, learns feature interactions, and filters noise from local feature perspectives, enabling subsequent CBAM and New Transformer Encoder modules to access rich input features.
The New Transformer Encoder plays a vital role in capturing global and local sequence context, enhancing holistic semantic understanding. Without it (DART-MT → S2), the F1-score for Class C dropped from 0.96 to 0.93, and the average F1-score fell to 0.9084. While performance declines were less drastic than ResNeXt18’s removal, the module’s ability to model sequential dependencies and contextual relationships remains essential for scenarios where feature order or global structure matters, complementing ResNeXt18’s local feature extraction.
The CBAM attention module optimizes feature discrimination by adaptively weighting channels and spatial locations. In Class E of ShipsEar, CBAM removal (DART-MT → S3) led to a decrease in precision/recall from 1.00 to 0.96, with the average F1-score dropping to 0.9062. By emphasizing task-relevant features and suppressing noise, CBAM enhances representation quality for challenging or imbalanced categories, demonstrating its importance in refining feature saliency and model focus.
In summary, ResNeXt18, New Transformer Encoder, and CBAM form a synergistic architecture: ResNeXt18 excels in local feature discrimination, the New Transformer Encoder enriches global semantic modeling, and CBAM sharpens feature relevance. Their integration enables DART-MT to achieve superior accuracy and expressiveness by addressing multi-scale feature representation, contextual dependencies, and feature saliency simultaneously.
4.6. Discussion of In-Depth Studies (RQ3)
4.6.1. Analysis of Model Performance with Pre-Trained Weights in UATR Task
Although there is a large domain gap between the pre-trained data and the target UATR task, models such as DART-MT with pre-trained weights still show outstanding performance and superiority. This not only highlights the importance of pre-trained weights in improving model accuracy but also suggests that with appropriate pre-training strategies, models can better adapt to tasks with domain gaps and achieve better results.
Comparing
Figure 9 and
Figure 10 reveals that at 1% labeling, data imbalance causes unrecognized categories in models without pre-trained weights, leading to a zero-shot dilemma. Pretrained weights eliminate this issue: without them, dataset imbalance degrades model accuracy by biasing learning toward overrepresented categories, leaving underrepresented ones poorly recognized.
Experiments on the ShipsEar dataset with 1%, 5%, and 10% label ratios show the DART model consistently outperforms ResNet18, ViT, DenseNet121, and EfficientNetB0 in recognition accuracy. Adding pre-trained weights to any model not only improves accuracy but also resolves zero-shot issues from sample scarcity. The DART-MT model with pre-trained weights further enhances accuracy compared to the standalone DART model, highlighting its superior effectiveness.
Pre-trained weights mitigate data imbalance by providing models with prior knowledge from large-scale, balanced datasets. This enriches feature representations and optimizes initialization parameters, enabling accurate recognition of underrepresented categories. In semi-supervised UATR with imbalanced samples, pre-trained weights address zero-shot challenges, demonstrating the framework’s transferability.
In summary, DART-MT with pre-trained weights surpasses ViT, ResNet18, DenseNet121, and EfficientNetB0 in recognition accuracy. Its ability to leverage pre-trained knowledge not only resolves small-sample imbalance issues but also exhibits strong transfer learning capabilities, effectively transferring knowledge to new tasks and enhancing recognition performance.
4.6.2. Improvement and Feature Visualization of Models with Limited Labeled Samples
The experimental findings demonstrate the efficacy of the proposed learning framework in enhancing model performance across varying conditions of limited labeled samples. The model reliance on the quantity of labeled samples is significantly diminished. Furthermore, the learning architecture proposed in this study enhances model performance even in scenarios with ample training samples. The experimental results demonstrate that the learning framework can extract information beyond labels, including the essential characteristics and representations of underwater acoustic data.
Additionally, we employed the t-distributed stochastic neighbor embedding (t-SNE) technique to demonstrate enhanced recognition performance and feature separability in models trained with fewer labeled samples. We randomly selected samples from each category of the test dataset. The model was trained on a dataset comprising 10% labeled samples. The learned deep features, represented by the pre-fully connected layer outputs, are shown in
Figure 11. A visual comparison of the deep features reveals that within the 10% training dataset, VIT-MT exhibits a substantial overlap in its feature space. Resnet18-MT, DenseNet121-MT, and EfficientNetB0-MT displayed reduced overlap relative to VIT-MT, while DART-MT showed even fewer overlap points and improved category separability, as demonstrated by the features it outputs.
4.6.3. Robustness Analysis and Verification of the Model
In the realm of semi-supervised learning, particularly within the research on the mean teacher framework, two critical issues demand in-depth investigation. First, the mean teacher framework exhibits a strong dependence on unlabeled data, yet the precise influence of variations in data quality and quantity on model performance remains incompletely understood. Given that low-quality data, such as those corrupted by noise, can impede the semi-supervised learning process, elucidating the model’s sensitivity to these data quality changes is of paramount importance. Second, in practical application scenarios—such as underwater environments, which are inherently noisy—existing studies lack comprehensive analyses of model robustness to noise. Understanding how models handle noise is essential for their effective deployment in real-world settings.
To address these research gaps, we employed the ShipsEar dataset with 50% data labeled, systematically introducing varying levels of Gaussian white noise (from −20 dB to 20 dB) to evaluate model performance. This choice of Gaussian white noise—serving as a foundational signal processing benchmark and proxy for underwater thermal noise—enables standardized assessment of noise intensity impacts, though it acknowledges a critical limitation: distinct noise types (e.g., impulsive, pink) may differently affect model predictions, warranting future studies to validate generalizability across diverse noise distributions.
Table 10 presents key performance metrics (accuracy, precision, recall, F1-score) for semi-supervised models—Resnet18-MT, DART-MT, Densenet-MT, EfficientNetB0-MT, VIT-MT—under varying noise levels. These results establish a foundation for analyzing model sensitivity to unlabeled data quality and noise resilience.
At −20 dB noise, DART-MT achieved 89.02% accuracy, outperforming Resnet18-MT in precision (89.48%, +0.67), recall (89.09%, +0.62), and F1-score (89.28%, +0.64), as well as DenseNet-MT and EfficientNetB0-MT. At −10 dB, DART-MT’s accuracy (80.75%) exceeded Resnet18-MT by 2.07%, with superior recall (80.74%, +1.89) and F1-score (81.86%, +0.83). At 0 dB, DART-MT’s accuracy (90.40%) and all other metrics outperformed competitors, while at 10 dB and 20 dB, it maintained consistent superiority. From −20 dB to 20 dB, DART-MT demonstrated exceptional robustness, with accuracy fluctuations of only 1.57%, far surpassing VIT-MT and other baselines. In underwater scenarios, DART-MT’s ability to mitigate noise-induced degradation and stabilize predictions underscores its superiority in handling low-quality unlabeled data within the mean teacher framework.
Although the model has been tested under different noise levels, the impacts of various types of noise, such as Gaussian noise, white noise, and impulsive noise, on the model may vary significantly. This study only used Gaussian white noise for testing. While it provides a standardized benchmark for evaluating the effects of noise intensity, it is insufficient to fully reflect the model’s robustness in complex real-world environments. For instance, impulsive noise is characterized by suddenness and high energy, which may cause misjudgments in the model. Pink noise, on the other hand, has uneven energy distribution across different frequency bands, and its interference mechanism with the model is entirely different from that of Gaussian white noise. Therefore, future research could introduce multiple types of noise to systematically explore the model’s response differences under various noise distributions, thereby enabling a more accurate assessment of the model’s adaptability to practical applications.
Overall, while our findings validate DART-MT’s robustness to Gaussian white noise, the unaddressed impact of diverse noise types highlights a critical research avenue. Explicitly characterizing noise parameters is imperative for both methodological rigor and practical deployment in noisy environments, positioning DART-MT as a robust choice for semi-supervised learning in complex acoustic settings.
4.6.4. Analysis of the Impact of Different Loss Functions on the Model
As shown in
Table 3, there was a significant imbalance between the five categories of the ShipsEar dataset. To solve this problem, a category balance-loss function can be considered. For example, a dynamic category balance-loss function that can adjust the weights based on the changing characteristics of the data can enable the model to adapt to an ever-changing category distribution, thereby providing more accurate predictions. Compared to other methods, the category balance loss function has unique advantages in dealing with such imbalance issues. By reasonably allocating weights to different categories, the model pays sufficient attention to the minority categories and improves their recognition accuracy. Unlike the adaptive sampling technique, which needs to consider the balance between oversampling and undersampling and the possible noise problems introduced, as well as the data augmentation technique, which may cause the model to overfit due to excessive enhancement of the minority categories.
To study the performance of different loss functions in solving the zero-shot problem and handling the sample imbalance in the ShipsEar dataset, we conducted a comparative experiment. The experimental results are listed in
Table 11. The experiment used 10% of the labeled data from the ShipsEar dataset. By constructing two new models, DART-MT-CE (using the cross-entropy function) and DART-MT-FL (using the focal loss function),and comparing them with the original DART-MT model (using CB Loss), we can clearly see the differences between different loss functions in handling the sample imbalance problem. In the data preprocessing stage, the selected 10% labeled ShipsEar dataset was processed with the standard audio sampling rate and segmented into 5 s segments. All models were trained according to the training settings of the DART-MT model in the original experiment, keeping parameters such as the momentum, optimizer, number of training epochs, learning rate adjustment strategy, and batch size unchanged. During the training process, changes in the loss value, accuracy, and other indicators of each model were recorded. After the training was completed, the same test set was used to evaluate the performance of the three models on evaluation indicators such as classification accuracy, recall rate, and F1 score. This experiment further verified the effectiveness and advantages of the category balance loss function for handling the sample imbalance problem of the ShipsEar dataset.
By comparing the DART-MT models using different loss functions, we found that DART-MT-CE (using the cross-entropy loss function) performed the worst among all indicators. Although the precision was relatively high, the recall rate was low, resulting in a relatively low F1 score, which is suitable for occasions where high precision is required but the recovery ability is not in high demand. DART-MT-FL (using the focal loss function) showed improvements in all indicators, especially the recall rate and F1 score, indicating that the focal loss function has certain advantages in addressing sample imbalance problems. However, DART-MT (using the Category Balance Loss CB Loss) performed the best for all evaluation indicators, with the highest accuracy, precision, recall, and F1 score. This demonstrates the significant advantage of CB Loss in handling zero-shot learning and solving the problem of sample imbalance. In practical applications, such as underwater acoustic target recognition, sample imbalance and zero-shot learning are common challenges, and CB Loss can better cope with these challenges and improve the performance and generalization ability of the model.
The exploration of different loss functions in this study provides crucial insights into the effectiveness of various loss functions in addressing the complex challenges of sample imbalance and zero-shot learning. The superiority of the CB Loss in the ShipsEar dataset can be attributed to several key factors. First, the CB Loss function is designed to handle the imbalanced distribution of classes by intelligently assigning appropriate weights to different classes. This mechanism ensures that the model pays more attention to minority classes, which are often overlooked in the presence of dominant classes. By doing so, it significantly improves the recognition accuracy of these underrepresented classes.
In the ShipsEar dataset, classes with fewer samples would receive higher weights, compelling the model to learn their unique features more comprehensively. This targeted approach helps balance the learning process and prevents the model from being overly influenced by majority classes. Moreover, CB Loss considers the specific nature of the data distribution and adjusts its weighting scheme to adapt to the specific challenges posed by the dataset. This adaptability makes it highly effective in handling the complex and diverse scenarios encountered in real-world applications.
4.6.5. Analysis of the Computational Complexity and Parameter Quantity of the Model
FLOPs and the number of parameters are vital in model research and evaluation. FLOPs indicate a model computational complexity, helping to assess the operational efficiency of different hardware platforms and select a suitable deployment environment. For example, in high-real-time underwater acoustic target monitoring, lower FLOPs can enhance response speed. The number of parameters is linked to a model complexity and expressiveness. While many parameters can handle complex data, they may cause overfitting with limited data. Studying this number aids in understanding the balance between learning and generalization. Analyzing parameters of various models offers a basis for model selection and improvement. We thus experimented with different models, and the results are shown in
Figure 12.
The UAPT [
51] gives an important foundation and direction for our experiments. Research in underwater acoustic target recognition (UATR) emphasizes the Transformer model advantages. Although it may have more parameters and FLOPs in some cases, this does not undermine its value in marine sonar recognition.
Regarding the number of parameters, a large quantity endows the model with stronger expressiveness. In marine sonar recognition, underwater acoustic signals are complex and diverse, containing various target features and noise. Adequate parameters allow the Transformer model to learn these patterns and better distinguish targets.
DART-MT has unique benefits in terms of the parameters and FLOPs. The parameters are 116.73 M, far exceeding Resnet-18-MT, ViT-MT, EfficientNet_b0-MT, and DenseNet-MT. Despite limited data, its many parameters give strong expressiveness and the ability to learn complex patterns, like a rich “knowledge store” in a small information space, and it uses semi-supervised mean teacher and pre-training transfer methods to enhance generalization and adaptability.
In FLOPs, DART-MT is 1.23 G, lower than other models, needing fewer computing resources for inference, especially useful in resource-limited settings like mobile or edge devices.
Overall, DART-MT parameter advantage helps it handle complex patterns, and its good FLOPs performance balances resource use in training and inference. This may let it outperform others in high-capacity and efficiency-demanding scenarios, though its superiority also depends on factors like architecture, data, training methods, and application scenarios.
4.6.6. Generability Analysis of the Model
In the current field of model research, the evaluation of different model performances is vital. To explore the performance of the various models in practical applications, we conducted a comprehensive comparative experiment. This experiment mainly focused on several key indicators of the models: the accuracy rate, training time, prediction time, and model support. By testing multiple mainstream models, we obtained a set of data with significant reference values, as listed in
Table 12.
The DART model outperforms state-of-the-art architectures in multiple key metrics. In terms of classification accuracy, it achieved 98.57 ± 0.031, significantly surpassing ResNet18, Vision Transformer (ViT), DenseNet121, and EfficientNetB0. This high accuracy indicates that DART excels at data classification and prediction tasks, enabling precise target identification. Its performance in fully supervised generalization verification further demonstrates excellent adaptability to diverse datasets, minimizing misclassification errors.
Regarding training efficiency, DART completed training in only 3.100 h, a notable reduction compared to ViT (7.305 h) and DenseNet121 (4.892 h), and shorter than EfficientNetB0 (3.452 h). This shorter training duration accelerates model development, enhancing iteration speed while reducing time and computational costs.
In prediction speed, DART’s 47 s prediction time provides a significant edge over ViT (139 s), DenseNet121 (135 s), EfficientNetB0 (91 s), and ResNet18 (56 s). This rapid prediction capability is crucial for real-time data processing applications, as it enables faster system response and improves overall availability, thus enhancing practical utility.
Given the variability of model performance across different domains, especially in real-world underwater acoustic datasets, investigating the transferability of DART-MT to other datasets is essential. Therefore, we applied the DeepShip dataset and conducted experiments following the procedures described in
Section 4.2 and
Section 4.3. The detailed experimental results are summarized in
Table 13.
At a 1% sample label ratio, DART-MT demonstrated a remarkable accuracy of 80.6 ± 0.005%. In contrast, Resnt18-MT achieved an accuracy of 73.2 ± 0.007%, VIT-MT had an accuracy of 66.4 ± 0.010%, DenseNet121-MT reached 69.3 ± 0.008%, and EfficientNetB0-MT obtained 75.2 ± 0.013%. These differences are not only numerically significant but also indicate DART-MT’s superior ability to learn from a minimal amount of labeled data. The asterisks (*) in the table denote that these improvements are statistically significant (p < 0.05), as determined by paired t-tests against the runner-up results. This statistical significance implies that the observed differences in accuracy are not due to random chance but rather reflect the inherent superiority of DART-MT in this low-label-data scenario.
As the sample label ratio increased to 5%, DART-MT’s accuracy rose to 88.2 ± 0.005%, again outperforming the other models. Resnt18-MT reached 82.2 ± 0.004%, VIT-MT achieved 73.4 ± 0.008%, DenseNet121-MT obtained 77.3 ± 0.008%, and EfficientNetB0-MT had an accuracy of 82.9 ± 0.013%. This continued dominance of DART-MT further validates its effectiveness in leveraging additional labeled data for performance improvement.
When the label ratio reached 10%, DART-MT maintained its lead with an accuracy of 96.2 ± 0.005%. Resnt18-MT’s accuracy was 88.9 ± 0.007%, VIT-MT achieved 80.2 ± 0.009%, DenseNet121-MT obtained 82.2 ± 0.010%, and EfficientNetB0-MT had an accuracy of 89.6 ± 0.013%. The significant gap between DART-MT and the other models at this stage indicates that DART-MT can better utilize the increased labeled data to refine its classification capabilities.
At 50% and 90% sample label ratios, DART-MT’s performance continued to excel. With a 50% label ratio, its accuracy was 97.8 ± 0.005%, and at 90%, it reached 98.8 ± 0.004%. The other models, while also showing improvements, lagged far behind. For example, at 90% label ratio, Resnt18-MT had an accuracy of 92.1 ± 0.008%, VIT-MT achieved 87.9 ± 0.010%, DenseNet121-MT obtained 87.8 ± 0.010%, and EfficientNetB0-MT had an accuracy of 93.1 ± 0.013%.
In conclusion, the data in
Table 13 clearly demonstrate DART-MT’s robust generalization ability. Its consistent high accuracy across different sample label ratios, especially at lower ratios where data scarcity poses a significant challenge, showcases its effectiveness in handling limited-data scenarios. This superiority over other models in recognizing underwater acoustic targets on the DeepShip dataset positions DART-MT as a promising approach in the field of underwater acoustic target recognition.
Meanwhile, in the literature review, we found that UART [
56] can provide favorable initialization for models through pretraining, reduce the risk of falling into local optimality, and demonstrate excellent few-shot classification capabilities. Based on this, we decided to compare the few-shot model with the semi-supervised model DART-MT. Since semi-supervised tasks do not require fully labeled data,
Table 14 omits the 100% sample setting. To verify the performance of DART-MT, we compared it with several few-shot models in the table and conducted experiments with labeled data proportions of 1%, 10%, and 50%. These experiments not only proved the adaptability of DART-MT to semi-supervised scenarios but also deeply revealed its unique breakthroughs in improving model accuracy, fully demonstrating its innovative value and superiority over models based on UART’s few-shot capabilities.
In the experiments on the DeepShip dataset, the comparison results between DART-MT and few-shot models fully demonstrate its significant advantages. When the labeled data accounts for only 1%, the accuracy of DART-MT reaches 80.60%, far exceeding that of EncoderA pre-trained with UART (55.26%) and the UART model itself (52.67%), highlighting its outstanding ability to efficiently utilize limited labeled data in extremely data-scarce scenarios. As the proportion of labeled data increases to 10% and 50%, DART-MT still maintains the lead, with accuracies increasing to 96.20% and 97.80%, respectively, which are significantly higher than other models.
This indicates that DART-MT can not only avoid overfitting under low-labeled data but also possesses stronger generalization ability, effectively optimizing model performance by leveraging unlabeled data through semi-supervised learning. Meanwhile, the SCTD pre-trained weights adopted by DART-MT are highly compatible with the semi-supervised strategy, better matching the data distribution in underwater acoustic target recognition tasks. This successfully breaks through the performance bottleneck of few-shot models in data-scarce scenarios, providing a more optimal solution for practical applications.
4.6.7. Comparison Between Fully Supervised and DART-MT Models
From
Table 15, the DART-MT model demonstrates outstanding performance in the semi-supervised learning framework, particularly when using different proportions of the training dataset. With only 10% of the training dataset, DART-MT achieves an accuracy of 94.86%, a precision of 94.45%, a recall of 95.15%, and an F1-score of 94.78%. These metrics outperform most fully supervised learning models, such as Yamnet (78.72% accuracy), VGGish (86.75% accuracy), and ADCNN (93.58% accuracy). Although models like CA_MobilenetV2 (98.16% accuracy) and BS-MSF-FAM-scSE (98.40% accuracy) show higher performance with full training data, DART-MT’s ability to leverage a small amount of labeled data combined with abundant unlabeled data in the semi-supervised framework significantly enhances model generalization. This highlights DART-MT’s effectiveness in optimizing recognition performance under limited labeled data conditions.
Furthermore, the stability of DART-MT is particularly noteworthy. With a standard deviation of merely 0.003 for accuracy and 0.002 for the F1-Score when using 90% of the training dataset, the model exhibits highly consistent performance across experiments, demonstrating strong reliability. Such stability is especially critical in semi-supervised learning, as the inclusion of unlabeled data may introduce noise, and DART-MT effectively addresses this challenge.
In the field of ship radiated noise classification, different models exhibit diverse performances on various datasets. As can be seen from the experimental results of the ShipsEar dataset mentioned above, the DART-MT model demonstrates excellent performance within a semi-supervised learning framework. By efficiently leveraging a small amount of labeled data and a large volume of unlabeled data, it not only outperforms a host of advanced models when using a high proportion of training data but also surpasses most fully supervised learning models when the proportion of training data is low. Moreover, it has remarkable stability. In the research of ship radiated noise, the DeepShip dataset is also an important experimental benchmark. To further explore the generalization ability and adaptability of different models, it is necessary to conduct comparative experiments on the DeepShip dataset using the models that have shown outstanding performance on the ShipsEar dataset. This allows for a more comprehensive evaluation of the actual effectiveness of each model in the task of ship radiated noise classification.
Table 16 presents the performance comparison of various classifiers on the DeepShip dataset. Notably, the DART-MT model demonstrates remarkable superiority even when trained with only 10% of the training dataset. It achieves an accuracy of 96.20%, a precision of 96.23%, a recall of 96.18%, and an F1-score of 96.16%, significantly outperforming many well-known models.
For example, the commonly used Yamnet model only achieves an accuracy of 69.53%, with precision, recall, and F1-score all below 70%. VGGish performs even worse, with an accuracy of 66.85% and other metrics slightly lower. Compared with advanced architectures such as ADCNN (90.23% accuracy) and MobileNetV2 (90.18% accuracy), DART-MT shows a significant improvement of over 6 percentage points. Even when compared with high-performing models like VFR (93.80% accuracy), CA_MobileNetV2 (93.50% accuracy), and BAHTNet (94.57% accuracy), DART-MT surpasses them in overall performance.
This highlights DART-MT’s remarkable ability to effectively utilize limited labeled data in a semi-supervised learning context, demonstrating its strong generalization and adaptability. It serves as a highly competitive and effective solution for ship radiated noise classification on the DeepShip dataset.
In summary, the advantages of DART-MT in semi-supervised learning are multifaceted: it not only achieves performance comparable to or exceeding that of fully supervised learning models with limited labeled data but also further improves its performance as the amount of training data increases, ultimately reaching the highest level among all models. Additionally, its combined strengths in high accuracy, recall, and precision, along with low experimental standard deviations, further validate the effectiveness and robustness of DART-MT within a semi-supervised learning framework. This capability highlights the significant potential of DART-MT in practical applications, particularly in scenarios where labeled data is scarce, as it substantially reduces reliance on large amounts of labeled data while maintaining exceptional performance.
4.6.8. Research on Multi-Scenario Verification of Cross-Environment Adaptability of Marine Acoustic Models
Current research on marine acoustic models predominantly focuses on performance verification under specific environments, with insufficient discussion on adaptability across different sea areas and hydrological conditions (such as ocean currents, temperature, and salinity changes). The complexity of marine environments leads to significant differences in the spectral distribution and intensity dynamic range between inshore ship noise and offshore natural turbulent noise. To fill this research gap, this experiment is based on the ShipsEar dataset and introduces typical environmental noise collected from the South China Sea.
The noise was collected in an area with a water depth of approximately 4000 m, a nearly flat seabed, and sea state level 1 (slight sea, wind speed 1–3 levels). An omnidirectional hydrophone with a sensitivity of −170 dB re 1 V/μPa and a frequency response covering 0.1 Hz to 80 kHz was deployed at a depth of 300 m, combined with a digital acquisition instrument with a sampling frequency of 20 kHz for a high-fidelity collection. The research vessel’s engine was turned off during data collection to eliminate self-noise, ensuring that the original noise only contains natural marine signals such as wave sounds and seabed geological activity noise.
In the experiment, the signal-to-noise ratio (SNR) was systematically controlled to −15 dB, −10 dB, −5 dB, 0 dB, 5 dB, and 10 dB. After dividing the 5 min original noise into 5 s intervals, random noise segments were fused with original audio according to different SNR gradients. Multi-scenario augmented data covering strong noise to clean signals were constructed only in the training set, while the validation set retained original data to objectively evaluate model performance. By comparing the performance of models trained with different SNR augmented data on the original validation set and noise test set, the robustness of target recognition and signal restoration in strong noise scenarios were analyzed.
This study accurately simulates noise changes under typical water depths and hydrological conditions in the South China Sea, which not only systematically verifies the model’s signal extraction capability in deep-sea strong noise environments (such as −15 dB) and recognition effect in inshore complex environments (0 dB to 10 dB) with mixed ship noise and biological signals but also provides key technical support for practical applications of marine acoustic models in cross-sea area acoustic monitoring and underwater target recognition.
Table 17 the DART-MT model demonstrates regular performance changes in South China Sea noise environments with different signal-to-noise ratios (SNRs). When the SNR gradually increases from −15 dB to 5 dB, the model’s accuracy rises from 91.49% to 93.53%, with the F1-Score peaking at 93.62% at 5 dB, indicating the most balanced signal extraction and recognition under moderate noise conditions. However, when the SNR further increases to 10 dB, the accuracy drops to 92.39%, possibly due to differences in noise spectral characteristics and training data distribution at high SNRs, leading to fluctuations in generalization ability. It is worth noting that the model still maintains an accuracy of over 91% in the strong noise environment of −15 dB, and the standard deviation of all indicators is less than 0.02, verifying its stability under extreme noise. In the noise-free scenario, the model’s accuracy reaches 94.86%, with a difference of only 1.33% from the 5 dB scenario, indicating strong noise resistance.
To deeply analyze the model’s classification performance across different categories and noise conditions,
Figure 13 presents the confusion matrices of the DART-MT model under various signal-to-noise ratio (SNR) scenarios. In contrast to the macro statistics of overall accuracy, precision, recall, and F1-score in the previous table, these matrices offer a fine-grained perspective on the model’s classification details for each category—clearly revealing which acoustic signal categories are more prone to misjudgment in specific noise environments.
Although the current experiment covers a wide range of SNRs, there are still limitations in the research on marine environmental adaptability. On the one hand, it only simulates environmental changes by controlling noise intensity, without involving the impact of different sea areas (such as inshore and offshore), seabed topography (reef areas and plains), or hydrological parameters (ocean currents, salinity) on noise characteristics. For example, the spectral differences between low-frequency turbulent noise in the deep South China Sea and high-frequency ship noise in inshore areas are not distinguished. On the other hand, the noise source is single, lacking the simulation of mixed scenarios of multi-source noise such as biological sonar and industrial noise, making it difficult to fully reflect the complexity of the real marine environment.
Future research can be expanded from multiple dimensions: introducing noise data from different sea areas and water depths to build a cross-environment database, and analyzing the model’s adaptability in scenarios with significant spectral feature differences; combining the physical impact of parameters such as temperature and salinity on sound propagation, and simulating signal transmission under different hydrological conditions through data augmentation; designing multi-source noise mixing experiments, such as superimposing ship, biological, and geological activity noise in proportion, to evaluate the model’s target recognition ability in complex scenarios such as inshore ports and offshore fishing grounds. These studies will significantly enhance the model’s cross-environment adaptability in practical applications such as deep-sea detection and inshore monitoring, providing more transferable technical solutions for global marine acoustic monitoring networks.
4.7. Hyper-Parameter Analysis (RQ4)
This study compares the proposed training framework with typical Resnet, VIT, DenseNet121, and EfficientNetB0 models for underwater target recognition using the mean teacher learning framework. During the self-supervised learning phase, all samples were treated as unlabeled, without using label information. In the supervised fine-tuning phase, the model loads the previously trained weights and fine-tunes various sample sizes. The training parameters during the supervised fine-tuning phase were aligned with those in the baseline experiment. Following supervised fine-tuning, training progressed to the semi-supervised fine-tuning phase with label ratios of 1%, 5%, 10%, 50%, and 90% in the training set. Ultimately, all models underwent training on all unlabeled samples using unsupervised self-distillation. The training process ensures that the model accesses only partial label information specified by the training dataset.
Table 18 shows that across all label ratios, the models employing the mean teacher learning strategy significantly outperformed those that did not. This suggests that the mean teacher training strategy effectively enhanced model performance. Under the mean teacher-training strategy, the student model boosts its generalization performance by jointly training with the teacher model through a blend of supervised and unsupervised learning. The teacher model created pseudo-labels for unlabeled data, which were then used to train the student model. Consistency constraints between student and teacher models allow for a better understanding of data characteristics and patterns, thus enhancing performance. Models leveraging the mean teacher strategy attained a higher accuracy across different label ratios. Notably, at lower label ratios, the DART-MT model showed a more significant performance improvement than the other models. Consequently, by integrating the mean teacher strategy, the DART-MT model makes better use of limited labeled data to enhance performance and allows the student model to leverage the teacher model knowledge for improved generalization.
Figure 14 shows that model accuracy increases with the training label ratio, as models can learn from more labeled samples to optimize performance. For instance, the DART-MT model accuracy goes from 0.800 at a 1% ratio to 0.8907 at 5%, indicating that severely limited labeled samples impede comprehensive learning and generalization, thus restricting performance. When the ratio rises from 10% to 50%, DART-MT accuracy improves from 0.9486 to 0.9645, thanks to better utilization of labeled-sample information. Notably, from 50% to 90%, DART-MT accuracy sees a 2.4% increase, from 0.9645 to 0.9885. This relatively modest rise might be due to the model already having sufficient labeled data at higher ratios, where additional increases yield only marginal performance gains. Overall, the training label ratio significantly affects model accuracy: increasing it within a certain range enhances performance, but beyond that point, improvements are minimal, and low ratios constrain performance. Therefore, choosing an appropriate label ratio requires balancing data-collection costs and potential performance enhancements.
Figure 15 illustrates that with 10% training labels, both training and validation losses decreased notably within the initial 12 epochs, suggesting effective feature and pattern learning by the model. Subsequently, both the training and validation losses stabilized without further decrease and fluctuated within a defined range. Observing the evolution of the training and validation losses reveals a relatively small gap between them, signifying the model’s strong generalization capability. Consequently, with such a small gap, the generalization ability of the model can be deemed acceptable.
Figure 16 presents the changes in the model precision, recall rate, and accuracy with the number of training epochs. When the training was about 10 epochs, the fluctuations of the three metric curves decreased and tended to stabilize. The model began to converge and its performance became stable. After convergence, the precision was nearly 0.95, with few misjudgments; the recall rate was about 0.95, with few positive examples missed; and the accuracy also remained high, indicating a good classification effect on both positive and negative samples. The model could converge in about 10 epochs, thanks to the good parameter initialization starting point provided by the pre-trained weights, avoiding the long-term oscillations and convergence difficulties that might be caused by random initialization. Moreover, the precision, recall rate, and accuracy remained at a high level after convergence, indicating that the pre-trained weights enabled the model to utilize the general features obtained from large-scale pre-training to capture the data patterns of the current task more quickly. Judging from the convergence speed and the final metric performance, the pre-trained weights most likely played a positive role.