1. Introduction
Driven by the growing demand for sustainable energy, offshore renewable-energy projects are expanding rapidly. Their deployment and maintenance require reliable monitoring of the marine environment and, in particular, accurate underwater object detection for tasks such as subsea cable installation and the inspection of underwater foundations.
Deep learning has become a dominant paradigm for object detection and now underpins a wide range of applications, including surveillance [
1,
2], medical image analysis [
3], autonomous driving, and agricultural monitoring [
4,
5]. In parallel, increasing ocean exploitation and exploration [
6,
7] have created a strong demand for accurate and timely underwater object detection and identification. These capabilities support safety-critical operations such as seabed resource development and underwater archaeological surveys [
8,
9], as well as search-and-recovery missions for sunken vessels, aircraft accidents, and missing persons [
10,
11].
Because acoustic waves propagate more effectively in water than electromagnetic and optical waves, sonar is widely used for underwater target detection and provides critical prior information for many underwater tasks [
12,
13,
14]. Nevertheless, sonar-based detection is challenging in real environments. Sediment coverage, biological occlusions, and water flow can partially obscure targets; diverse noise sources distort echoes; and variations in temperature, salinity, and depth further degrade signal quality. Consequently, sonar images often contain many small objects with blurred boundaries and weak textures, making them easy to confuse with background clutter [
15,
16,
17]. Moreover, collecting and annotating real sonar data is difficult and expensive, and publicly available datasets remain limited [
12,
18]. These factors, together with the domain gap between acoustic and optical imagery [
19], hinder the direct transfer of mainstream deep-learning detectors and often lead to poor generalization across sonar sensors and operating conditions.
Many underwater applications impose strict real-time constraints, which motivates detectors with compact models and fast inference [
20]. Such detectors can be deployed on resource-limited platforms (e.g., small vessels or embedded systems) to process sonar data on site and avoid latency caused by transmitting data to onshore facilities. In addition, rapid and reliable target localization and identification are essential for timely decision-making in operational scenarios [
12,
21]. Therefore, it is practically important to develop sonar-image detectors that achieve high accuracy for small objects while remaining robust to noise and efficient enough for real-time deployment across diverse sonar modalities.
Currently, existing underwater sonar image-based object detectors can be broadly categorized into two main types. The first type is derived from target detectors originally designed for visible light images [
22]. Given the relatively mature development of object detection techniques based on optical images, leveraging the extensive experience and technological advancements from this domain can significantly accelerate the development of high-performance detectors for acoustic image-based object detection [
23,
24]. For instance, Fan et al. improved the YOLOv4 network by streamlining its backbone network to reduce model parameters and network depth, thereby meeting real-time requirements, and enhanced the PAN module to improve the detection accuracy of small targets [
25]. Meanwhile, Le et al. utilized parameterized Gabor filtering modules to enhance the scale and orientation decomposition of images, thereby improving the generalization capability, detection accuracy, and inference speed of the detection model on underwater sonar images [
26].
The second category of approaches explicitly considers sonar-image characteristics and proposes task-specific modules or architectures. For instance, Wang et al. developed a Multi-Level Feature Fusion Network (MLFFNet) that integrates multi-scale feature extraction and fusion together with attention mechanisms and demonstrated its effectiveness on a dedicated sonar dataset [
27]. Zhou et al. proposed a detector for forward-looking sonar images that incorporates global clustering and classical feature-mapping and discrimination techniques, achieving competitive performance on their dataset [
28].
Another major challenge for sonar-image detection is the limited availability of labeled datasets, which can prevent adequate training and increase the risk of overfitting. Existing solutions can be broadly grouped into three directions. (i) Data augmentation generates additional training samples from existing data. For example, Phung et al. used generative adversarial networks to synthesize sonar-like images and incorporated a hierarchical Gaussian process classifier to improve recognition performance [
29]. Huang et al. analyzed sonar-image formation mechanisms and designed augmentation strategies that better preserve sonar-specific appearance characteristics [
30]. (ii) Few-shot and zero-shot learning aims to improve generalization when only a small number of labeled samples are available. Zhou et al. proposed a few-shot detector based on prototype relation embedding and contrastive learning [
31], while Jiao et al. introduced a decoupled training framework with balanced ensemble transfer learning to alleviate long-tail effects in scarce-data settings [
32]. (iii) Transfer learning leverages large-scale optical or infrared datasets for pre-training and then fine-tunes the detector on sonar data [
33]. For instance, Tang et al. pre-trained YOLOv3 on COCO and fine-tuned it on real sonar datasets [
34], and Zhang et al. adapted YOLOv5 for sonar images through architectural refinements and subsequent fine-tuning [
35].
In addition to these earlier studies, recent studies continue to emphasize lightweight and robust detection for sonar scenarios, including YOLO-based optimization and feature-enhancement designs for side-scan imagery [
36,
37]. These studies further confirm that balancing accuracy, robustness, and real-time efficiency remains a central research trend.
Despite these advances, most sonar-image detectors are still based on convolutional architectures. Although CNN-based methods can be efficient, they may struggle with small objects and degraded imagery because they primarily aggregate local features through sliding-window operations, which limits their ability to capture long-range context. In addition, noise and clutter can be amplified in intermediate feature maps, reducing robustness. Transformer-based models can model global interactions more effectively, but their computational cost often limits real-time deployment in sonar-image detection [
38].
Motivated by the above observations, we aim to develop a sonar-image detector that improves small-object detection and robustness to noise while remaining efficient for real-time deployment. Given the advantage of Transformers in modeling the global context, we adopt RT-DETR as the baseline architecture [
39]. RT-DETR is an end-to-end, real-time detector that combines DETR-style set prediction with multi-scale feature fusion and uses deformable attention and IoU-aware query selection to accelerate convergence and improve efficiency. These properties make it a suitable starting point for building a practical sonar-image detection algorithm.
However, local structural cues (e.g., edges and fine textures) remain important for accurate localization in sonar images. Therefore, we retain the overall RT-DETR detection framework and redesign the backbone by integrating convolutional and Transformer components. The proposed backbone contains two parallel streams: a Transformer stream for global context modeling and a CNN stream for local feature extraction. Features from the two streams are fused at intermediate and final stages to leverage complementary global and local information.
To further improve robustness, we design a Noise Filtering Module (NFM) that suppresses noise-related responses in intermediate feature maps. We also develop a transfer-learning pipeline tailored to the scarcity of sonar data. In particular, we analyze the effect of different pre-training sources and construct a noise-augmented dataset to train the NFM with a dedicated denoising stage before fine-tuning the full detector on sonar data. By integrating these components, we propose T2C-DETR (Transformer + Convolution Detection Transformer), a sonar-image detector that retains the efficient RT-DETR framework while introducing (i) a dual-channel Transformer–CNN backbone for complementary global and local feature extraction and (ii) an NFM-enhanced neck for noise suppression, together with a transfer-learning strategy to enable effective training with limited sonar annotations.
In summary, the main contributions of this paper are as follows:
We design a new backbone and a noise filtering module within the RT-DETR framework to address small-target detection, noise interference, and limited training data, and we validate the approach on a custom sonar dataset.
We propose a dual-channel backbone that integrates Transformer and convolutional modules and performs feature fusion at multiple stages to combine global context with local details.
We introduce a noise filtering module that suppresses noise in intermediate feature maps, thereby improving detection accuracy by emphasizing informative features.
We develop a transfer-learning strategy that analyzes different pre-training sources and includes a dedicated denoising stage for training the noise filtering module, enabling effective learning across diverse sonar tasks.
The remainder of this paper is organized as follows:
Section 2 provides a brief overview of the relevant knowledge pertaining to the design proposed in this paper.
Section 3 details the architecture of the T2C-DETR network proposed in this paper, including the newly designed backbone network and NFM.
Section 4 presents extensive experimental results. Finally, the paper concludes with a summary in
Section 5.
3. Method
3.1. T2C-DETR
The T2C-DETR proposed in this paper is built upon the RT-DETR framework. It preserves the core end-to-end detection pipeline and efficient multi-scale fusion design, while introducing task-oriented modifications for sonar imagery. Relative to the baseline, T2C-DETR incorporates three key improvements: (1) a Transformer–Convolution dual-channel backbone to jointly model global context and local structures, (2) a Noise Filtering Module (NFM) inserted into the neck to suppress noise-related activations in feature maps, and (3) a specialized training strategy that leverages transfer learning and stage-wise optimization to cope with limited sonar annotations.
The overall architecture of T2C-DETR is depicted in
Figure 2. In the backbone, the input image first passes through a convolutional stem for low-level feature extraction. The extracted base feature maps are then fed into two parallel channels: a Swin-Transformer channel for long-range dependency modeling and a convolutional channel for local structure extraction. Compared with the basic Transformer module, Swin-Transformer provides a more favorable accuracy–efficiency trade-off, thereby supporting real-time inference. At multiple stages, feature maps from the two channels are concatenated and fused to form hybrid representations, which are forwarded to the neck. Ultimately, the backbone outputs three multi-scale feature maps at three stages for downstream detection.
The neck of T2C-DETR follows the RT-DETR design and incorporates additional noise suppression. Specifically, the three backbone feature maps are first processed by three Noise Filtering Modules (NFMs) to attenuate noise-related responses. The encoder then applies self-attention to the NFM-enhanced S3 feature map, which carries richer semantics, while S1 and S2 bypass the encoder to reduce computation. Next, S1 and S2 are fused with the encoded S3 through a Path Aggregation Network (PAN) to aggregate multi-scale information. Finally, the fused feature maps are unfolded along the channel dimension and concatenated to construct the memory input for the IoU-aware query selection module.
The decision to apply self-attention only to the S3 feature map follows RT-DETR’s observation that self-attention on the highest-level feature map, followed by PAN-based fusion, can achieve better accuracy than direct cross-scale attention while substantially reducing computation. Moreover, avoiding self-attention on multiple large feature maps is beneficial for real-time inference. The IoU-aware query selection module maps the neck memory to token scores and selects the top-k tokens to initialize object queries for the decoder. Specifically, the top 300 tokens ranked by classification confidence are used as content queries. An auxiliary bounding-box predictor then estimates preliminary boxes, which are encoded as positional queries. Content queries and positional queries are combined to form the final initialized object queries. This initialization provides the decoder with higher-quality starting points, improves query-to-memory interaction, reduces optimization difficulty, and accelerates training convergence.
In each decoder layer, object queries first perform self-attention to capture interactions among candidate objects and to model their spatial relationships. The refined queries then attend to the neck memory through cross-attention to retrieve features relevant to each object. Following RT-DETR, deformable attention is adopted to accelerate both training and inference by restricting attention to a sparse set of sampling locations. Through cross-attention, each query progressively aggregates informative features from the memory and becomes more discriminative. Each decoder layer is equipped with an auxiliary detection head to produce intermediate predictions, which provides additional supervision and stabilizes training.
The predictions are matched against the ground truth (GT) using bipartite graph matching to compute the loss. The loss calculation is summarized as follows:
where
is the optimal bipartite matching,
and
denote predictions and ground truth, and
combines classification,
box regression, and IoU losses.
Moreover, we develop a transfer-learning training scheme tailored to T2C-DETR. First, the full network is pre-trained on a large-scale visible-light or infrared dataset to learn general-purpose representations. Next, we sample images from both the pre-training dataset and our proprietary sonar dataset and apply random noise to construct a small denoising set. During this stage, all modules except the NFM are frozen, and the NFM is trained separately to enhance noise suppression. Finally, the backbone and the NFM are frozen, and the remaining modules are fine-tuned on our small sonar dataset to adapt the detector to acoustic imagery. After training, the number of decoders can be adjusted to trade accuracy for speed and model size without requiring retraining, which is convenient for deployment under different resource constraints.
3.2. Transformer + Convolution Dual-Channel Backbone Network
Transformer blocks are effective at capturing global context via self-attention, which enables long-range dependency modeling and improves robustness when target appearance is degraded, cluttered, or partially ambiguous. In contrast, convolutional blocks extract features with local receptive fields and are well-suited for representing fine-grained structures such as edges, contours, and local textures. From a computational perspective, global attention usually scales quadratically with the number of tokens, whereas convolution scales approximately linearly with image resolution. For sonar imagery, both characteristics are essential: global context supports target localization under strong noise and low contrast and reduces background confusion, while local cues help delineate blurred boundaries and suppress false alarms induced by speckle-like interference. Therefore, integrating Transformer-based global modeling with convolutional local representation is a natural and effective choice for improving detection stability across diverse sonar conditions.
Based on the aforementioned ideas, this paper proposes a novel Transformer + Convolution dual-channel backbone network, termed TCDCNet, as illustrated in
Figure 3. We employ concatenation-based fusion at multiple stages (rather than element-wise addition or attention-based fusion) to preserve the full diversity of features from both pathways without information loss. Addition operations risk erasing discriminative signals when features have different scales, while attention-based fusion introduces additional parameters and computational overhead that could compromise real-time requirements. The concatenation strategy maintains complementary global-context and local-detail representations while remaining computationally efficient for embedded deployment. The overall architecture of TCDCNet is as follows:
3.3. NFM Module
When extracting features from noisy sonar images, noise-related patterns are inevitably propagated into intermediate feature maps together with target cues. If such responses are not explicitly suppressed, they may be amplified by subsequent fusion operations and eventually mislead the detection heads, resulting in degraded localization and increased false alarms. To mitigate this issue, we propose a Noise Filtering Module (NFM) and place it at the front of the neck to denoise the multi-scale features produced by the backbone. This location is chosen because it suppresses noise before encoder/decoder feature interaction and multi-scale fusion, thereby reducing the propagation and amplification of noise responses in downstream modules while preserving the original backbone design. The overall structure of the NFM is shown in
Figure 4.
The module comprises two parallel branches: (1) an upper branch with a standard convolution and activation to preserve complementary local responses; and (2) a lower branch that combines depthwise separable convolution, a nonlinear activation, and a squeeze-and-excitation (SE) unit to capture lightweight channel-wise dependencies.
The input feature maps are processed by the two branches in parallel. Their outputs are concatenated and then fed into a convolutional block attention module (CBAM) to adaptively reweight both channel and spatial responses, followed by a convolution for channel projection. With this design, the NFM selectively attenuates noise-dominated activations while retaining discriminative target features, thereby reducing interference to the downstream detector and improving overall detection accuracy.
3.4. Specialized Training Strategy
Considering the scarcity of annotated underwater sonar datasets for detector training, we design a transfer-learning-based training strategy that is tailored to the architecture and modules of the proposed detector. The overall procedure is shown in
Figure 5, which explicitly summarizes the three-stage optimization workflow (pre-training, NFM-only denoising adaptation, and sonar-domain fine-tuning) and clarifies which modules are frozen or trainable at each stage. First, the entire network is pre-trained on a large-scale general-purpose dataset so that the backbone, encoder, and decoder can learn robust feature extraction and representation refinement capabilities. Next, a few hundred images are sampled from the pre-training dataset, and random noise is injected to construct a lightweight denoising set. During this stage, all parameters except those of the three NFM modules are frozen, and the NFMs are fine-tuned to explicitly learn noise suppression in intermediate feature maps. We use the same detection objective as the main training stage (classification + box regression + IoU losses), so gradients are propagated only through NFM parameters while all other modules remain frozen; no additional standalone denoising loss is introduced. Finally, the backbone, all NFMs, the encoder, and the decoder are frozen, and only the IoU-aware query selection module, the auxiliary box predictor for query initialization, and the decoder-specific detection heads are trained on our self-built small-scale sonar dataset. This stage-wise optimization aligns the task-specific mapping with the acoustic imaging characteristics, yielding a detector that is better adapted to the target sonar domain.
It is worth noting that optical images typically contain richer textures and more complete geometric details than sonar imagery, which may help pre-training learn strong generic representations that transfer to sonar detection. In contrast, infrared images are often closer to sonar in terms of low contrast, limited texture, and coarse structural patterns. Therefore, we conduct separate pre-training using both optical and infrared datasets and report the corresponding results in
Table 1,
Table 2 and
Table 3. For the NFM denoising stage, we apply diverse stochastic degradations to the sampled images, such as random patch corruption, detail blurring, and region-wise tone perturbation, to emulate typical sonar artifacts and improve the NFM’s generalization to unseen noise patterns.
4. Experiment
4.1. Configuration
To identify which pre-training data source better transfers to underwater sonar imagery, we conducted three parallel experiments in which the original T2C-DETR was comprehensively pre-trained on the COCO 2017 dataset, the Data small object detection dataset, and an infrared–visible dataset, respectively. For the standalone training of the NFM, we sampled a subset of images from the corresponding pre-training dataset, injected random noise to build a lightweight denoising set, and used it to fine-tune the NFM. Representative examples of the noise-augmented data are shown in
Figure 6. Unless otherwise specified, we used the standard COCO AP protocol for evaluation, and all input images were resized to
.
The infrared pre-training data are from the public FLIR thermal dataset [
52]. We use only the thermal channel for pre-training and retain object categories shared with our detection setting (e.g., vehicles and persons) as generic foreground targets. This dataset provides lower-contrast and texture-sparse imagery compared with visible-light datasets, which is beneficial for transferring representations to sonar scenes with weak texture and blurred boundaries.
In the comparative experiments, we evaluated the proposed T2C-DETR against an improved YOLOv5 [
35], MLFFNET [
27], and the baseline Transformer detector [
39]. For YOLOv5 and MLFFNET, after obtaining the pre-trained models, we froze all layers except the detection heads and fine-tuned only the heads using our self-built small-scale sonar dataset. For the baseline, after pre-training, we fine-tuned the IoU-aware query selection module, the auxiliary box predictor used for query initialization, and the decoder-specific detection heads on the same sonar dataset.
Although newer YOLO versions are available, we selected YOLOv5 as the main one-stage comparator for two reasons. First, improved YOLOv5 variants have been explicitly validated for sonar imagery [
35], making it a representative and reproducible baseline in this domain. Second, YOLOv5 remains widely adopted in embedded and real-time deployment settings, and therefore provides a practical reference for evaluating the accuracy–speed trade-off of T2C-DETR under comparable engineering constraints.
For the baseline and T2C-DETR, the pre-training stage lasted 72 epochs, and T2C-DETR was further trained for 36 epochs for the NFM denoising stage. The improved YOLOv5 adopted the L model and, together with MLFFNET, was pre-trained for 300 epochs. Fine-tuning on the small-scale sonar dataset was performed for 100 epochs for both methods. All experiments were conducted on two GTX 3080 Ti GPUs with Ubuntu 20.04.
4.2. Execution Details
During training, both T2C-DETR and the baseline employed the IoU-aware module to select the top 300 tokens for initializing object queries in the decoder. Unless otherwise stated, the training schedule, hyperparameters, and denoising-related settings followed the baseline configuration. We optimized all detectors using AdamW with base_learning_rate = 0.0001, weight_decay = 0.0001, global_gradient_clip_norm = 0.0001, and linear_warmup_steps = 2000. Exponential moving average (EMA) was also applied with EMA_decay_rate = 0.999.
For YOLOv5 and MLFFNET, we followed the training protocols reported in their original papers. In addition to the dedicated NFM denoising stage, we adopted standard data augmentation operations, including random color distortion, image expansion, random cropping, horizontal flipping, and multi-scale resizing.
Given the dominance of small objects in underwater sonar datasets, we report mAP at an IoU threshold of 0.5 (mAP@0.5) as the primary accuracy metric for all detectors. We additionally report inference speed in frames per second (FPS) to characterize real-time performance. The mAP metric jointly reflects false positives and false negatives through precision (
P) and recall (
R), which are computed as follows:
where TP is the number of correctly predicted positives, FP is the number of false positives, and FN is the number of false negatives. For each category,
P and
R are computed at IoU = 0.5 to obtain a set of precision–recall (P–R) curves. The area under the P–R curve corresponds to the Average Precision (AP):
Here,
and
denote the precision and recall values at the
k-th operating point. The final mAP is obtained by averaging the AP values over all categories:
C denotes the number of categories, and
is the AP of the
i-th category.
The custom underwater sonar dataset used in this paper comprises 5000 images with three object categories: mines, sunken ships, and crashed airplanes. The values 3220 (mine), 2860 (crashed airplane), and 2230 (sunken ship) refer to annotated object instances rather than image counts; multiple objects may appear in one image. Small objects dominate this dataset and account for 92.5% of all instances.
Figure 6 shows representative clean sonar examples, while
Figure 7 presents representative noise-augmented samples used in the NFM adaptation stage. The pre-training datasets include COCO 2017, DOTA, and an infrared image dataset.
4.3. Result Analysis
Table 1,
Table 2 and
Table 3 summarize the quantitative comparisons between the proposed T2C-DETR and competing detectors, where all models are pre-trained on different source datasets and then fine-tuned on our custom sonar dataset. In the three parallel settings, T2C-DETR achieves AP values of 97.8%, 98.2%, and 98.5%, with corresponding inference speeds of 72, 73, and 72 FPS when pre-trained on COCO 2017, DOTA, and the infrared dataset, respectively. Overall, T2C-DETR consistently yields a favorable accuracy–speed trade-off and outperforms detectors of comparable scale. Specifically, relative to the baseline, T2C-DETR improves AP by 0.7%, 1.2%, and 0.9% with similar real-time throughput. Compared with YOLOv5-Imp, T2C-DETR achieves 1.3%, 1.3%, and 1.2% higher AP while being faster at inference. Compared with MLFFNet, T2C-DETR yields AP gains of 1.4%, 1.0%, and 1.1%, together with higher FPS.
Notably, pre-training on the infrared dataset produces the best overall performance. A plausible explanation is that this infrared dataset contains a higher proportion of small objects than COCO 2017 and exhibits a visual style closer to sonar imagery than DOTA (e.g., lower contrast and less texture). In addition, infrared images often include blurred or low-detail targets, which resemble the acoustic imaging characteristics of sonar. These factors make infrared pre-training particularly effective for transferring to sonar object detection.
4.4. Ablation Experiment
To validate the effectiveness of the proposed Transformer + Convolution dual-channel backbone network (TCDCNet) and the Noise Filtering Module (NFM), we conducted ablation experiments to assess the impact of these components on the final results. We compared the performance of baseline algorithms with different improvement methods while maintaining consistent training parameters. The comparative results are presented in
Table 4,
Table 5 and
Table 6, where ✓ indicates the module is enabled.
Based on the results shown in
Table 4,
Table 5 and
Table 6, it is evident that both the TCDCNet and the NFM significantly enhance the detector’s performance. In these experiments, the TCDCNet improved the AP by 0.5%, 0.9%, and 0.7%, respectively, compared to the baseline. The NFM also contributed to performance gains, with improvements of 0.2%, 0.6%, and 0.4% AP compared to the baseline. When both improvements were combined, the results showed even more significant enhancements.
These findings strongly support the effectiveness of the proposed TCDCNet and NFM in improving the overall performance of the detector.
4.5. Statistical Analysis and Reproducibility
To improve reproducibility, we clarified the infrared pre-training source (FLIR) and the objective used in the NFM-only stage (same detection loss, with only NFM parameters updated). We also explicitly distinguish image-level and instance-level statistics for the custom sonar dataset.
To quantify stability across source domains, we summarize the three parallel pre-training settings (COCO, DOTA, and infrared) in
Table 7. T2C-DETR achieves an AP@0.5 mean of
with a standard deviation of
, compared with
for the baseline,
for YOLOv5-Imp, and
for MLFFNet. In terms of runtime, T2C-DETR reaches
FPS, while the baseline, YOLOv5-Imp, and MLFFNet reach
,
, and
FPS, respectively. These statistics indicate that T2C-DETR provides both higher average accuracy and more stable cross-source performance than competing methods.
From the ablation results (
Table 4,
Table 5 and
Table 6), the TCDCNet-only variant improves AP@0.5 over baseline by
, the NFM-only variant improves by
, and the full model improves by
. This decomposition shows that both modules contribute consistently, with TCDCNet accounting for the larger share of gains and NFM providing additional improvements under noisy sonar conditions.
For deployment-oriented comparison, we additionally report the accuracy–speed product (AP@0.5×FPS) as a practical composite indicator. Averaged over the three source domains, T2C-DETR achieves 7099, compared with 6904 (baseline), 6298 (YOLOv5-Imp), and 6111 (MLFFNet), further supporting its favorable real-time trade-off. T2C-DETR incurs only marginal overhead, increasing from 42M to 45M parameters and from 136 GFLOPs to 142 GFLOPs, while maintaining real-time speeds comparable to the baseline model.
5. Conclusions
This paper presents T2C-DETR, a sonar-image object detector designed to address key challenges in practical underwater perception, including small-object detection, strong noise interference, and data scarcity. Built upon the RT-DETR framework, T2C-DETR preserves the efficient end-to-end pipeline to maintain a lightweight architecture and real-time inference capability. We further propose a Transformer–Convolution dual-channel backbone to jointly capture long-range context and fine-grained local structures, where multi-stage cross-fusion enables complementary global–local representations for improved detection accuracy. In addition, we integrate an NFM into the neck to suppress noise-dominated activations in multi-scale feature maps, thereby enhancing the utilization of target-relevant information.
To alleviate the small-sample limitation of sonar datasets, we develop a stage-wise transfer learning strategy. Specifically, we first pre-train the full network on large-scale visible and infrared datasets to learn general-purpose representations. We then construct a compact denoising set by injecting random degradations into samples from the pre-training and sonar datasets and fine-tune only the NFM modules with the remaining components frozen. Finally, we freeze the backbone and NFM modules and fine-tune the task-specific modules on our custom small-scale sonar dataset to obtain the final detector.
Extensive experiments demonstrate that T2C-DETR effectively mitigates common difficulties in sonar image analysis, including small-target detection under noise and learning with limited annotations. The proposed design exhibits strong robustness and adaptability across multiple sonar datasets, indicating its potential for practical underwater sonar object detection applications.
Quantitatively, under three pre-training settings (COCO 2017, DOTA, and infrared), T2C-DETR reaches AP values of 97.8%, 98.2%, and 98.5% at real-time speeds of 72–73 FPS. Relative to the RT-DETR baseline, AP is improved by 0.7–1.2%; relative to YOLOv5-Imp, AP gains are 1.2–1.3% with higher FPS. These results directly support the conclusion that the proposed dual-channel backbone, NFM, and transfer-learning strategy jointly improve both detection accuracy and deployment efficiency.