T2C-DETR: A Transformer + Convolution Dual-Channel Backbone Network for Underwater Sonar Image Object Detection

Wu, Xiaobing; Tan, Panlong; Zhang, Xiaoyu; Sun, Hao

doi:10.3390/a19040281

Open AccessArticle

T2C-DETR: A Transformer + Convolution Dual-Channel Backbone Network for Underwater Sonar Image Object Detection

by

Xiaobing Wu

¹,

Panlong Tan

²

,

Xiaoyu Zhang

^3,* and

Hao Sun

³

¹

College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China

²

Future Technology Center, Haihe Lab of ITAI, Tianjin 300459, China

³

School of Artificial Intelligence, Nankai University, Tianjin 300350, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(4), 281; https://doi.org/10.3390/a19040281

Submission received: 16 February 2026 / Revised: 15 March 2026 / Accepted: 21 March 2026 / Published: 3 April 2026

Download

Browse Figures

Versions Notes

Abstract

Underwater sonar object detection is challenging because targets are often small, boundaries are blurred, background clutter is strong, and labeled sonar data are limited. To address these issues, we propose T2C-DETR, a detector built on RT-DETR with three task-oriented improvements: (i) a Transformer–Convolution dual-channel backbone (TCDCNet) for complementary global-context and local-detail modeling, (ii) a Noise Filtering Module (NFM) inserted before neck fusion to suppress noise-dominated activations, and (iii) a stage-wise transfer-learning strategy tailored to small sonar datasets. We evaluate the method under three pre-training sources (COCO 2017, DOTA, and an infrared dataset) and then fine-tune on a self-built sonar dataset. Experimental results show that T2C-DETR achieves AP50 of 97.8%, 98.2%, and 98.5% at 72–73 FPS, consistently outperforming the RT-DETR baseline, YOLOv5-Imp, and MLFFNet in the accuracy–speed trade-off. These results indicate that combining global–local representation learning with targeted noise suppression is effective for practical real-time sonar detection.

Keywords:

underwater sonar images; Transformer architecture; Transformer + CNN; transfer learning; fuzzy small target detection

1. Introduction

Driven by the growing demand for sustainable energy, offshore renewable-energy projects are expanding rapidly. Their deployment and maintenance require reliable monitoring of the marine environment and, in particular, accurate underwater object detection for tasks such as subsea cable installation and the inspection of underwater foundations.

Deep learning has become a dominant paradigm for object detection and now underpins a wide range of applications, including surveillance [1,2], medical image analysis [3], autonomous driving, and agricultural monitoring [4,5]. In parallel, increasing ocean exploitation and exploration [6,7] have created a strong demand for accurate and timely underwater object detection and identification. These capabilities support safety-critical operations such as seabed resource development and underwater archaeological surveys [8,9], as well as search-and-recovery missions for sunken vessels, aircraft accidents, and missing persons [10,11].

Because acoustic waves propagate more effectively in water than electromagnetic and optical waves, sonar is widely used for underwater target detection and provides critical prior information for many underwater tasks [12,13,14]. Nevertheless, sonar-based detection is challenging in real environments. Sediment coverage, biological occlusions, and water flow can partially obscure targets; diverse noise sources distort echoes; and variations in temperature, salinity, and depth further degrade signal quality. Consequently, sonar images often contain many small objects with blurred boundaries and weak textures, making them easy to confuse with background clutter [15,16,17]. Moreover, collecting and annotating real sonar data is difficult and expensive, and publicly available datasets remain limited [12,18]. These factors, together with the domain gap between acoustic and optical imagery [19], hinder the direct transfer of mainstream deep-learning detectors and often lead to poor generalization across sonar sensors and operating conditions.

Many underwater applications impose strict real-time constraints, which motivates detectors with compact models and fast inference [20]. Such detectors can be deployed on resource-limited platforms (e.g., small vessels or embedded systems) to process sonar data on site and avoid latency caused by transmitting data to onshore facilities. In addition, rapid and reliable target localization and identification are essential for timely decision-making in operational scenarios [12,21]. Therefore, it is practically important to develop sonar-image detectors that achieve high accuracy for small objects while remaining robust to noise and efficient enough for real-time deployment across diverse sonar modalities.

Currently, existing underwater sonar image-based object detectors can be broadly categorized into two main types. The first type is derived from target detectors originally designed for visible light images [22]. Given the relatively mature development of object detection techniques based on optical images, leveraging the extensive experience and technological advancements from this domain can significantly accelerate the development of high-performance detectors for acoustic image-based object detection [23,24]. For instance, Fan et al. improved the YOLOv4 network by streamlining its backbone network to reduce model parameters and network depth, thereby meeting real-time requirements, and enhanced the PAN module to improve the detection accuracy of small targets [25]. Meanwhile, Le et al. utilized parameterized Gabor filtering modules to enhance the scale and orientation decomposition of images, thereby improving the generalization capability, detection accuracy, and inference speed of the detection model on underwater sonar images [26].

The second category of approaches explicitly considers sonar-image characteristics and proposes task-specific modules or architectures. For instance, Wang et al. developed a Multi-Level Feature Fusion Network (MLFFNet) that integrates multi-scale feature extraction and fusion together with attention mechanisms and demonstrated its effectiveness on a dedicated sonar dataset [27]. Zhou et al. proposed a detector for forward-looking sonar images that incorporates global clustering and classical feature-mapping and discrimination techniques, achieving competitive performance on their dataset [28].

Another major challenge for sonar-image detection is the limited availability of labeled datasets, which can prevent adequate training and increase the risk of overfitting. Existing solutions can be broadly grouped into three directions. (i) Data augmentation generates additional training samples from existing data. For example, Phung et al. used generative adversarial networks to synthesize sonar-like images and incorporated a hierarchical Gaussian process classifier to improve recognition performance [29]. Huang et al. analyzed sonar-image formation mechanisms and designed augmentation strategies that better preserve sonar-specific appearance characteristics [30]. (ii) Few-shot and zero-shot learning aims to improve generalization when only a small number of labeled samples are available. Zhou et al. proposed a few-shot detector based on prototype relation embedding and contrastive learning [31], while Jiao et al. introduced a decoupled training framework with balanced ensemble transfer learning to alleviate long-tail effects in scarce-data settings [32]. (iii) Transfer learning leverages large-scale optical or infrared datasets for pre-training and then fine-tunes the detector on sonar data [33]. For instance, Tang et al. pre-trained YOLOv3 on COCO and fine-tuned it on real sonar datasets [34], and Zhang et al. adapted YOLOv5 for sonar images through architectural refinements and subsequent fine-tuning [35].

In addition to these earlier studies, recent studies continue to emphasize lightweight and robust detection for sonar scenarios, including YOLO-based optimization and feature-enhancement designs for side-scan imagery [36,37]. These studies further confirm that balancing accuracy, robustness, and real-time efficiency remains a central research trend.

Despite these advances, most sonar-image detectors are still based on convolutional architectures. Although CNN-based methods can be efficient, they may struggle with small objects and degraded imagery because they primarily aggregate local features through sliding-window operations, which limits their ability to capture long-range context. In addition, noise and clutter can be amplified in intermediate feature maps, reducing robustness. Transformer-based models can model global interactions more effectively, but their computational cost often limits real-time deployment in sonar-image detection [38].

Motivated by the above observations, we aim to develop a sonar-image detector that improves small-object detection and robustness to noise while remaining efficient for real-time deployment. Given the advantage of Transformers in modeling the global context, we adopt RT-DETR as the baseline architecture [39]. RT-DETR is an end-to-end, real-time detector that combines DETR-style set prediction with multi-scale feature fusion and uses deformable attention and IoU-aware query selection to accelerate convergence and improve efficiency. These properties make it a suitable starting point for building a practical sonar-image detection algorithm.

However, local structural cues (e.g., edges and fine textures) remain important for accurate localization in sonar images. Therefore, we retain the overall RT-DETR detection framework and redesign the backbone by integrating convolutional and Transformer components. The proposed backbone contains two parallel streams: a Transformer stream for global context modeling and a CNN stream for local feature extraction. Features from the two streams are fused at intermediate and final stages to leverage complementary global and local information.

To further improve robustness, we design a Noise Filtering Module (NFM) that suppresses noise-related responses in intermediate feature maps. We also develop a transfer-learning pipeline tailored to the scarcity of sonar data. In particular, we analyze the effect of different pre-training sources and construct a noise-augmented dataset to train the NFM with a dedicated denoising stage before fine-tuning the full detector on sonar data. By integrating these components, we propose T2C-DETR (Transformer + Convolution Detection Transformer), a sonar-image detector that retains the efficient RT-DETR framework while introducing (i) a dual-channel Transformer–CNN backbone for complementary global and local feature extraction and (ii) an NFM-enhanced neck for noise suppression, together with a transfer-learning strategy to enable effective training with limited sonar annotations.

In summary, the main contributions of this paper are as follows:

We design a new backbone and a noise filtering module within the RT-DETR framework to address small-target detection, noise interference, and limited training data, and we validate the approach on a custom sonar dataset.
We propose a dual-channel backbone that integrates Transformer and convolutional modules and performs feature fusion at multiple stages to combine global context with local details.
We introduce a noise filtering module that suppresses noise in intermediate feature maps, thereby improving detection accuracy by emphasizing informative features.
We develop a transfer-learning strategy that analyzes different pre-training sources and includes a dedicated denoising stage for training the noise filtering module, enabling effective learning across diverse sonar tasks.

The remainder of this paper is organized as follows: Section 2 provides a brief overview of the relevant knowledge pertaining to the design proposed in this paper. Section 3 details the architecture of the T2C-DETR network proposed in this paper, including the newly designed backbone network and NFM. Section 4 presents extensive experimental results. Finally, the paper concludes with a summary in Section 5.

2. Related Work

2.1. Transformer

Transformers were first introduced in natural language processing and have since become a standard architecture for sequence modeling [40,41]. Their core idea is to use self-attention to model long-range dependencies by allowing each token to attend to all others in the sequence. A typical Transformer consists of encoder and decoder stacks built from multi-head attention (MHA) and feed-forward network (FFN) blocks. The attention operation can be written as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(1)

Here,

Q

,

K

, and

V

denote the query, key, and value matrices, respectively, and

d_{k}

is the key dimensionality. Multi-head attention computes attention in several subspaces and concatenates the resulting heads:

\begin{matrix} MultiHead (Q, K, V) = Concat ({head}_{1}, {head}_{2}, {head}_{3}) \\ {head}_{i} = Attention ({QW}_{i}^{Q}, {KW}_{i}^{K}, {VW}_{i}^{V}) \end{matrix},

(2)

In this formulation,

W^{q}

,

W^{k}

, and

W^{v}

are learnable projection matrices. The FFN block is typically implemented as two fully connected layers with a nonlinearity and is applied to each token independently.

Transformers were later adapted to computer vision. Vision Transformer (ViT) represents an image as a sequence of patch tokens and applies the encoder to capture global interactions [42]. Subsequent variants, such as Swin Transformer [43], Conformer [44], and LeViT [45], improved efficiency and multi-scale representation. In object detection, DETR formulates detection as set prediction and removes hand-crafted components such as anchor design and non-maximum suppression, enabling end-to-end training [39,46,47]. These developments have motivated a growing body of Transformer-based detectors.

2.2. RT-DETR

Compared with CNN-based target detectors, DETR promotes a cleaner end-to-end formulation [46]. In typical CNN pipelines, components such as non-maximum suppression (NMS) and hand-crafted priors (e.g., anchors or pre-selection boxes) add implementation complexity and introduce additional hyperparameters that need careful tuning. DETR removes these elements by reformulating object detection as a set prediction problem. During training, it uses bipartite matching to assign predicted boxes to ground-truth instances in a one-to-one manner. This matching-based formulation discourages duplicate predictions and replaces the conventional NMS stage. Since the number of predictions is usually in the hundreds, the matching cost is relatively low compared with running NMS over tens of thousands of candidates in dense CNN detectors. As a result, DETR enables end-to-end training with a single global loss and a more uniform detection pipeline.

However, the original DETR model is known to converge slowly and to perform poorly on small objects [46]. These shortcomings have motivated a number of improved DETR variants [47]. Deformable DETR reduces computational cost by attending to a sparse set of sampling points, improving both convergence and small-object performance [47]. Conditional DETR introduces conditional spatial priors to accelerate matching and stabilize training. DN-DETR incorporates denoising training to provide stronger supervision for object queries, and DINO further improves training dynamics and assignment quality with refined query and matching strategies. Together, these extensions have significantly improved the practicality of DETR-style detectors and broadened their applicability.

Despite these advances, the computation of Transformer-based detectors remained a bottleneck for real-time deployment until the introduction of RT-DETR [39], one of the first real-time deployable detectors in the DETR family. The basic framework of RT-DETR is illustrated in Figure 1, where S3* denotes the S3 feature map after processing by the Transformer encoder with self-attention, which captures global context and rich semantics, whereas S1 and S2 bypass the encoder to reduce computation; arrows indicate the direction of data flow, and colors distinguish features at different processing stages. RT-DETR uses intra-scale feature interaction and cross-scale feature fusion, instead of applying expensive multi-layer cross-scale attention across several large feature maps. This design reduces encoder overhead while maintaining the benefits of multi-scale representations, which are particularly important for detecting small targets.

RT-DETR also introduces an IoU-aware (uncertainty-minimal) query selection module [39]. It scores tokens from the encoder’s output memory and selects high-quality tokens to initialize object queries, providing better semantic and localization cues at the beginning of decoding. This initialization reduces optimization difficulty and accelerates training convergence. With these design choices, RT-DETR achieves a favorable trade-off between accuracy and speed, and it provides a strong baseline for real-time Transformer-based object detection.

2.3. Transfer Learning

Transfer learning studies how knowledge learned from a source domain or task can be reused to improve performance on a related target domain [48]. In practical settings, transfer learning relaxes the strict assumption that training and test data must follow independent and identically distributed (i.i.d.) conditions. In particular, the source and target domains may follow different marginal or conditional distributions, yet the model can still benefit from reusable representations. For tasks where labels are costly to obtain, transfer learning is attractive because it can reduce the amount of target supervision required and often leads to faster and more stable optimization. In deep learning, transfer learning is widely adopted in computer vision (CV) and natural language processing (NLP) [49,50]. A common workflow is to pre-train a model on a large-scale source dataset and then adapt it to a target dataset through fine-tuning. For vision models, pre-training typically provides a backbone with generic feature extraction capability. During adaptation, early layers are often frozen to preserve transferable low-level features, whereas higher layers and task-specific heads are fine-tuned to match the target label space and data characteristics. Depending on the target dataset size, fine-tuning can be performed on the full network or on selected modules to balance performance and computational cost. When the distribution shift between the source and target domains is large, direct fine-tuning may still yield limited generalization. In such cases, domain adaptation techniques are commonly introduced to reduce the mismatch, for example, by aligning intermediate feature distributions or adding dedicated adaptation modules [51]. For sonar imagery, transfer learning is particularly relevant because large-scale annotated acoustic datasets are scarce, whereas optical and infrared datasets are abundant. Carefully designed pre-training and adaptation strategies therefore provide a practical route to improving sonar-image detectors under limited sonar annotations.

3. Method

3.1. T2C-DETR

The T2C-DETR proposed in this paper is built upon the RT-DETR framework. It preserves the core end-to-end detection pipeline and efficient multi-scale fusion design, while introducing task-oriented modifications for sonar imagery. Relative to the baseline, T2C-DETR incorporates three key improvements: (1) a Transformer–Convolution dual-channel backbone to jointly model global context and local structures, (2) a Noise Filtering Module (NFM) inserted into the neck to suppress noise-related activations in feature maps, and (3) a specialized training strategy that leverages transfer learning and stage-wise optimization to cope with limited sonar annotations.

The overall architecture of T2C-DETR is depicted in Figure 2. In the backbone, the input image first passes through a convolutional stem for low-level feature extraction. The extracted base feature maps are then fed into two parallel channels: a Swin-Transformer channel for long-range dependency modeling and a convolutional channel for local structure extraction. Compared with the basic Transformer module, Swin-Transformer provides a more favorable accuracy–efficiency trade-off, thereby supporting real-time inference. At multiple stages, feature maps from the two channels are concatenated and fused to form hybrid representations, which are forwarded to the neck. Ultimately, the backbone outputs three multi-scale feature maps at three stages for downstream detection.

The neck of T2C-DETR follows the RT-DETR design and incorporates additional noise suppression. Specifically, the three backbone feature maps are first processed by three Noise Filtering Modules (NFMs) to attenuate noise-related responses. The encoder then applies self-attention to the NFM-enhanced S3 feature map, which carries richer semantics, while S1 and S2 bypass the encoder to reduce computation. Next, S1 and S2 are fused with the encoded S3 through a Path Aggregation Network (PAN) to aggregate multi-scale information. Finally, the fused feature maps are unfolded along the channel dimension and concatenated to construct the memory input for the IoU-aware query selection module.

The decision to apply self-attention only to the S3 feature map follows RT-DETR’s observation that self-attention on the highest-level feature map, followed by PAN-based fusion, can achieve better accuracy than direct cross-scale attention while substantially reducing computation. Moreover, avoiding self-attention on multiple large feature maps is beneficial for real-time inference. The IoU-aware query selection module maps the neck memory to token scores and selects the top-k tokens to initialize object queries for the decoder. Specifically, the top 300 tokens ranked by classification confidence are used as content queries. An auxiliary bounding-box predictor then estimates preliminary boxes, which are encoded as positional queries. Content queries and positional queries are combined to form the final initialized object queries. This initialization provides the decoder with higher-quality starting points, improves query-to-memory interaction, reduces optimization difficulty, and accelerates training convergence.

In each decoder layer, object queries first perform self-attention to capture interactions among candidate objects and to model their spatial relationships. The refined queries then attend to the neck memory through cross-attention to retrieve features relevant to each object. Following RT-DETR, deformable attention is adopted to accelerate both training and inference by restricting attention to a sparse set of sampling locations. Through cross-attention, each query progressively aggregates informative features from the memory and becomes more discriminative. Each decoder layer is equipped with an auxiliary detection head to produce intermediate predictions, which provides additional supervision and stabilizes training.

The predictions are matched against the ground truth (GT) using bipartite graph matching to compute the loss. The loss calculation is summarized as follows:

\begin{matrix} \hat{σ} & = arg min_{σ} \sum_{i = 1}^{N} L_{match} (y_{i}, {\hat{y}}_{σ (i)}), \\ L_{\det} & = \sum_{i = 1}^{N} [λ_{cls} L_{cls} (c_{i}, {\hat{c}}_{\hat{σ} (i)}) + ⊮_{{c_{i} \neq ⌀}} (λ_{ℓ_{1}} {∥ b_{i} - {\hat{b}}_{\hat{σ} (i)} ∥}_{1} + λ_{iou} L_{iou} (b_{i}, {\hat{b}}_{\hat{σ} (i)}))], \end{matrix}

(3)

where

\hat{σ}

is the optimal bipartite matching,

\hat{y} = \{\hat{c}, \hat{b}\}

and

y = \{c, b\}

denote predictions and ground truth, and

L_{\det}

combines classification,

ℓ_{1}

box regression, and IoU losses.

Moreover, we develop a transfer-learning training scheme tailored to T2C-DETR. First, the full network is pre-trained on a large-scale visible-light or infrared dataset to learn general-purpose representations. Next, we sample images from both the pre-training dataset and our proprietary sonar dataset and apply random noise to construct a small denoising set. During this stage, all modules except the NFM are frozen, and the NFM is trained separately to enhance noise suppression. Finally, the backbone and the NFM are frozen, and the remaining modules are fine-tuned on our small sonar dataset to adapt the detector to acoustic imagery. After training, the number of decoders can be adjusted to trade accuracy for speed and model size without requiring retraining, which is convenient for deployment under different resource constraints.

3.2. Transformer + Convolution Dual-Channel Backbone Network

Transformer blocks are effective at capturing global context via self-attention, which enables long-range dependency modeling and improves robustness when target appearance is degraded, cluttered, or partially ambiguous. In contrast, convolutional blocks extract features with local receptive fields and are well-suited for representing fine-grained structures such as edges, contours, and local textures. From a computational perspective, global attention usually scales quadratically with the number of tokens, whereas convolution scales approximately linearly with image resolution. For sonar imagery, both characteristics are essential: global context supports target localization under strong noise and low contrast and reduces background confusion, while local cues help delineate blurred boundaries and suppress false alarms induced by speckle-like interference. Therefore, integrating Transformer-based global modeling with convolutional local representation is a natural and effective choice for improving detection stability across diverse sonar conditions.

Based on the aforementioned ideas, this paper proposes a novel Transformer + Convolution dual-channel backbone network, termed TCDCNet, as illustrated in Figure 3. We employ concatenation-based fusion at multiple stages (rather than element-wise addition or attention-based fusion) to preserve the full diversity of features from both pathways without information loss. Addition operations risk erasing discriminative signals when features have different scales, while attention-based fusion introduces additional parameters and computational overhead that could compromise real-time requirements. The concatenation strategy maintains complementary global-context and local-detail representations while remaining computationally efficient for embedded deployment. The overall architecture of TCDCNet is as follows:

3.3. NFM Module

When extracting features from noisy sonar images, noise-related patterns are inevitably propagated into intermediate feature maps together with target cues. If such responses are not explicitly suppressed, they may be amplified by subsequent fusion operations and eventually mislead the detection heads, resulting in degraded localization and increased false alarms. To mitigate this issue, we propose a Noise Filtering Module (NFM) and place it at the front of the neck to denoise the multi-scale features produced by the backbone. This location is chosen because it suppresses noise before encoder/decoder feature interaction and multi-scale fusion, thereby reducing the propagation and amplification of noise responses in downstream modules while preserving the original backbone design. The overall structure of the NFM is shown in Figure 4.

The module comprises two parallel branches: (1) an upper branch with a standard convolution and activation to preserve complementary local responses; and (2) a lower branch that combines depthwise separable convolution, a nonlinear activation, and a squeeze-and-excitation (SE) unit to capture lightweight channel-wise dependencies.

The input feature maps are processed by the two branches in parallel. Their outputs are concatenated and then fed into a convolutional block attention module (CBAM) to adaptively reweight both channel and spatial responses, followed by a

1 \times 1

convolution for channel projection. With this design, the NFM selectively attenuates noise-dominated activations while retaining discriminative target features, thereby reducing interference to the downstream detector and improving overall detection accuracy.

3.4. Specialized Training Strategy

Considering the scarcity of annotated underwater sonar datasets for detector training, we design a transfer-learning-based training strategy that is tailored to the architecture and modules of the proposed detector. The overall procedure is shown in Figure 5, which explicitly summarizes the three-stage optimization workflow (pre-training, NFM-only denoising adaptation, and sonar-domain fine-tuning) and clarifies which modules are frozen or trainable at each stage. First, the entire network is pre-trained on a large-scale general-purpose dataset so that the backbone, encoder, and decoder can learn robust feature extraction and representation refinement capabilities. Next, a few hundred images are sampled from the pre-training dataset, and random noise is injected to construct a lightweight denoising set. During this stage, all parameters except those of the three NFM modules are frozen, and the NFMs are fine-tuned to explicitly learn noise suppression in intermediate feature maps. We use the same detection objective as the main training stage (classification + box regression + IoU losses), so gradients are propagated only through NFM parameters while all other modules remain frozen; no additional standalone denoising loss is introduced. Finally, the backbone, all NFMs, the encoder, and the decoder are frozen, and only the IoU-aware query selection module, the auxiliary box predictor for query initialization, and the decoder-specific detection heads are trained on our self-built small-scale sonar dataset. This stage-wise optimization aligns the task-specific mapping with the acoustic imaging characteristics, yielding a detector that is better adapted to the target sonar domain.

It is worth noting that optical images typically contain richer textures and more complete geometric details than sonar imagery, which may help pre-training learn strong generic representations that transfer to sonar detection. In contrast, infrared images are often closer to sonar in terms of low contrast, limited texture, and coarse structural patterns. Therefore, we conduct separate pre-training using both optical and infrared datasets and report the corresponding results in Table 1, Table 2 and Table 3. For the NFM denoising stage, we apply diverse stochastic degradations to the sampled images, such as random patch corruption, detail blurring, and region-wise tone perturbation, to emulate typical sonar artifacts and improve the NFM’s generalization to unseen noise patterns.

4. Experiment

4.1. Configuration

To identify which pre-training data source better transfers to underwater sonar imagery, we conducted three parallel experiments in which the original T2C-DETR was comprehensively pre-trained on the COCO 2017 dataset, the Data small object detection dataset, and an infrared–visible dataset, respectively. For the standalone training of the NFM, we sampled a subset of images from the corresponding pre-training dataset, injected random noise to build a lightweight denoising set, and used it to fine-tune the NFM. Representative examples of the noise-augmented data are shown in Figure 6. Unless otherwise specified, we used the standard COCO AP protocol for evaluation, and all input images were resized to

640 \times 640

.

The infrared pre-training data are from the public FLIR thermal dataset [52]. We use only the thermal channel for pre-training and retain object categories shared with our detection setting (e.g., vehicles and persons) as generic foreground targets. This dataset provides lower-contrast and texture-sparse imagery compared with visible-light datasets, which is beneficial for transferring representations to sonar scenes with weak texture and blurred boundaries.

In the comparative experiments, we evaluated the proposed T2C-DETR against an improved YOLOv5 [35], MLFFNET [27], and the baseline Transformer detector [39]. For YOLOv5 and MLFFNET, after obtaining the pre-trained models, we froze all layers except the detection heads and fine-tuned only the heads using our self-built small-scale sonar dataset. For the baseline, after pre-training, we fine-tuned the IoU-aware query selection module, the auxiliary box predictor used for query initialization, and the decoder-specific detection heads on the same sonar dataset.

Although newer YOLO versions are available, we selected YOLOv5 as the main one-stage comparator for two reasons. First, improved YOLOv5 variants have been explicitly validated for sonar imagery [35], making it a representative and reproducible baseline in this domain. Second, YOLOv5 remains widely adopted in embedded and real-time deployment settings, and therefore provides a practical reference for evaluating the accuracy–speed trade-off of T2C-DETR under comparable engineering constraints.

For the baseline and T2C-DETR, the pre-training stage lasted 72 epochs, and T2C-DETR was further trained for 36 epochs for the NFM denoising stage. The improved YOLOv5 adopted the L model and, together with MLFFNET, was pre-trained for 300 epochs. Fine-tuning on the small-scale sonar dataset was performed for 100 epochs for both methods. All experiments were conducted on two GTX 3080 Ti GPUs with Ubuntu 20.04.

4.2. Execution Details

During training, both T2C-DETR and the baseline employed the IoU-aware module to select the top 300 tokens for initializing object queries in the decoder. Unless otherwise stated, the training schedule, hyperparameters, and denoising-related settings followed the baseline configuration. We optimized all detectors using AdamW with base_learning_rate = 0.0001, weight_decay = 0.0001, global_gradient_clip_norm = 0.0001, and linear_warmup_steps = 2000. Exponential moving average (EMA) was also applied with EMA_decay_rate = 0.999.

For YOLOv5 and MLFFNET, we followed the training protocols reported in their original papers. In addition to the dedicated NFM denoising stage, we adopted standard data augmentation operations, including random color distortion, image expansion, random cropping, horizontal flipping, and multi-scale resizing.

Given the dominance of small objects in underwater sonar datasets, we report mAP at an IoU threshold of 0.5 (mAP@0.5) as the primary accuracy metric for all detectors. We additionally report inference speed in frames per second (FPS) to characterize real-time performance. The mAP metric jointly reflects false positives and false negatives through precision (P) and recall (R), which are computed as follows:

P = \frac{TP}{TP + FP},

(4)

R = \frac{TP}{TP + FN},

(5)

where TP is the number of correctly predicted positives, FP is the number of false positives, and FN is the number of false negatives. For each category, P and R are computed at IoU = 0.5 to obtain a set of precision–recall (P–R) curves. The area under the P–R curve corresponds to the Average Precision (AP):

A P = \frac{1}{n} \sum_{k = 1}^{n} (R_{k} - R_{k - 1}) P_{k},

(6)

Here,

P_{k}

and

R_{k}

denote the precision and recall values at the k-th operating point. The final mAP is obtained by averaging the AP values over all categories:

m A P = \frac{1}{C} \sum_{i = 1}^{C} {AP}_{i},

(7)

C denotes the number of categories, and

A P_{i}

is the AP of the i-th category.

The custom underwater sonar dataset used in this paper comprises 5000 images with three object categories: mines, sunken ships, and crashed airplanes. The values 3220 (mine), 2860 (crashed airplane), and 2230 (sunken ship) refer to annotated object instances rather than image counts; multiple objects may appear in one image. Small objects dominate this dataset and account for 92.5% of all instances. Figure 6 shows representative clean sonar examples, while Figure 7 presents representative noise-augmented samples used in the NFM adaptation stage. The pre-training datasets include COCO 2017, DOTA, and an infrared image dataset.

4.3. Result Analysis

Table 1, Table 2 and Table 3 summarize the quantitative comparisons between the proposed T2C-DETR and competing detectors, where all models are pre-trained on different source datasets and then fine-tuned on our custom sonar dataset. In the three parallel settings, T2C-DETR achieves AP values of 97.8%, 98.2%, and 98.5%, with corresponding inference speeds of 72, 73, and 72 FPS when pre-trained on COCO 2017, DOTA, and the infrared dataset, respectively. Overall, T2C-DETR consistently yields a favorable accuracy–speed trade-off and outperforms detectors of comparable scale. Specifically, relative to the baseline, T2C-DETR improves AP by 0.7%, 1.2%, and 0.9% with similar real-time throughput. Compared with YOLOv5-Imp, T2C-DETR achieves 1.3%, 1.3%, and 1.2% higher AP while being faster at inference. Compared with MLFFNet, T2C-DETR yields AP gains of 1.4%, 1.0%, and 1.1%, together with higher FPS.

Notably, pre-training on the infrared dataset produces the best overall performance. A plausible explanation is that this infrared dataset contains a higher proportion of small objects than COCO 2017 and exhibits a visual style closer to sonar imagery than DOTA (e.g., lower contrast and less texture). In addition, infrared images often include blurred or low-detail targets, which resemble the acoustic imaging characteristics of sonar. These factors make infrared pre-training particularly effective for transferring to sonar object detection.

4.4. Ablation Experiment

To validate the effectiveness of the proposed Transformer + Convolution dual-channel backbone network (TCDCNet) and the Noise Filtering Module (NFM), we conducted ablation experiments to assess the impact of these components on the final results. We compared the performance of baseline algorithms with different improvement methods while maintaining consistent training parameters. The comparative results are presented in Table 4, Table 5 and Table 6, where ✓ indicates the module is enabled.

Based on the results shown in Table 4, Table 5 and Table 6, it is evident that both the TCDCNet and the NFM significantly enhance the detector’s performance. In these experiments, the TCDCNet improved the AP by 0.5%, 0.9%, and 0.7%, respectively, compared to the baseline. The NFM also contributed to performance gains, with improvements of 0.2%, 0.6%, and 0.4% AP compared to the baseline. When both improvements were combined, the results showed even more significant enhancements.

These findings strongly support the effectiveness of the proposed TCDCNet and NFM in improving the overall performance of the detector.

4.5. Statistical Analysis and Reproducibility

To improve reproducibility, we clarified the infrared pre-training source (FLIR) and the objective used in the NFM-only stage (same detection loss, with only NFM parameters updated). We also explicitly distinguish image-level and instance-level statistics for the custom sonar dataset.

To quantify stability across source domains, we summarize the three parallel pre-training settings (COCO, DOTA, and infrared) in Table 7. T2C-DETR achieves an AP@0.5 mean of

98.17 %

with a standard deviation of

0.35 %

, compared with

97.23 % \pm 0.32 %

for the baseline,

96.90 % \pm 0.40 %

for YOLOv5-Imp, and

97.00 % \pm 0.53 %

for MLFFNet. In terms of runtime, T2C-DETR reaches

72.33 \pm 0.58

FPS, while the baseline, YOLOv5-Imp, and MLFFNet reach

71.00 \pm 1.73

,

65.00 \pm 1.00

, and

63.00 \pm 1.00

FPS, respectively. These statistics indicate that T2C-DETR provides both higher average accuracy and more stable cross-source performance than competing methods.

From the ablation results (Table 4, Table 5 and Table 6), the TCDCNet-only variant improves AP@0.5 over baseline by

0.70 % \pm 0.20 %

, the NFM-only variant improves by

0.40 % \pm 0.20 %

, and the full model improves by

0.93 % \pm 0.25 %

. This decomposition shows that both modules contribute consistently, with TCDCNet accounting for the larger share of gains and NFM providing additional improvements under noisy sonar conditions.

For deployment-oriented comparison, we additionally report the accuracy–speed product (AP@0.5×FPS) as a practical composite indicator. Averaged over the three source domains, T2C-DETR achieves 7099, compared with 6904 (baseline), 6298 (YOLOv5-Imp), and 6111 (MLFFNet), further supporting its favorable real-time trade-off. T2C-DETR incurs only marginal overhead, increasing from 42M to 45M parameters and from 136 GFLOPs to 142 GFLOPs, while maintaining real-time speeds comparable to the baseline model.

5. Conclusions

This paper presents T2C-DETR, a sonar-image object detector designed to address key challenges in practical underwater perception, including small-object detection, strong noise interference, and data scarcity. Built upon the RT-DETR framework, T2C-DETR preserves the efficient end-to-end pipeline to maintain a lightweight architecture and real-time inference capability. We further propose a Transformer–Convolution dual-channel backbone to jointly capture long-range context and fine-grained local structures, where multi-stage cross-fusion enables complementary global–local representations for improved detection accuracy. In addition, we integrate an NFM into the neck to suppress noise-dominated activations in multi-scale feature maps, thereby enhancing the utilization of target-relevant information.

To alleviate the small-sample limitation of sonar datasets, we develop a stage-wise transfer learning strategy. Specifically, we first pre-train the full network on large-scale visible and infrared datasets to learn general-purpose representations. We then construct a compact denoising set by injecting random degradations into samples from the pre-training and sonar datasets and fine-tune only the NFM modules with the remaining components frozen. Finally, we freeze the backbone and NFM modules and fine-tune the task-specific modules on our custom small-scale sonar dataset to obtain the final detector.

Extensive experiments demonstrate that T2C-DETR effectively mitigates common difficulties in sonar image analysis, including small-target detection under noise and learning with limited annotations. The proposed design exhibits strong robustness and adaptability across multiple sonar datasets, indicating its potential for practical underwater sonar object detection applications.

Quantitatively, under three pre-training settings (COCO 2017, DOTA, and infrared), T2C-DETR reaches AP values of 97.8%, 98.2%, and 98.5% at real-time speeds of 72–73 FPS. Relative to the RT-DETR baseline, AP is improved by 0.7–1.2%; relative to YOLOv5-Imp, AP gains are 1.2–1.3% with higher FPS. These results directly support the conclusion that the proposed dual-channel backbone, NFM, and transfer-learning strategy jointly improve both detection accuracy and deployment efficiency.

Author Contributions

Conceptualization, X.W. and P.T.; methodology, X.W. and H.S.; validation, X.W., P.T., and X.Z.; writing—original draft preparation, X.W.; writing—review and editing, P.T., X.Z., and H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, X.; Tao, Z.; Ming, W.; An, Q.; Chen, M. Intelligent monitoring and diagnostics using a novel integrated model based on deep learning and multi-sensor feature fusion. Measurement 2020, 165, 108086. [Google Scholar] [CrossRef]
Nguyen, V.; Jenssen, R.; Roverso, D. Intelligent monitoring and inspection of power line components powered by UAVs and deep learning. IEEE Power Energy Technol. Syst. J. 2019, 6, 11–21. [Google Scholar] [CrossRef]
Shen, D.; Wu, G.; Suk, H.-I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef] [PubMed]
Kuutti, S.; Bowden, R.; Jin, Y.; Barber, P.; Fallah, S. A survey of deep learning applications to autonomous vehicle control. IEEE Trans. Intell. Transp. Syst. 2020, 22, 712–733. [Google Scholar] [CrossRef]
Trivedi, M.; Gupta, A. Automatic monitoring of the growth of plants using deep learning-based leaf segmentation. Int. J. Appl. Sci. Eng. 2021, 18, 2020281. [Google Scholar]
Hartmeyer, P.A.; Weirich, J.; Van Tilburg, H.; Copeland, A.; Malik, M.; Cantelas, F.; Cuellar, S.; Suhre, K.; Cantwell, K. Community-Driven Marine Archaeology: NOAA Ocean Exploration Operations in the Pacific Basin 2024–2026. Int. J. Naut. Archaeol. 2024, 53, 273–279. [Google Scholar] [CrossRef]
Kingsley, S. Challenges of maritime archaeology: In too deep. In A Companion to Cultural Resource Management; Blackwell Publishing, Ltd.: Oxford, UK, 2011; pp. 223–244. [Google Scholar]
Wang, Y.; Liu, W.; Liu, J.; Sun, C. Cooperative USV–UAV marine search and rescue with visual navigation and reinforcement learning-based control. ISA Trans. 2023, 137, 222–235. [Google Scholar] [CrossRef]
Kapur, S.V. Marine Search and Rescue Using Light Weight Neural Networks. Master’s Thesis, Halifax, NS, Canada, 2019. [Google Scholar]
Xiao, Y.; Tian, Z.; Yu, J.; Zhang, Y.; Liu, S.; Du, S.; Lan, X. A review of object detection based on deep learning. Multimed. Tools Appl. 2020, 79, 23729–23791. [Google Scholar] [CrossRef]
Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
Lee, S.; Park, B.; Kim, A. Deep learning from shallow dives: Sonar image generation and training for underwater object detection. arXiv 2018, arXiv:1810.07990. [Google Scholar] [CrossRef]
Neupane, D.; Seok, J. A review on deep learning-based approaches for automatic sonar target recognition. Electronics 2020, 9, 1972. [Google Scholar] [CrossRef]
Dos Santos, M.M.; De Giacomo, G.G.; Drews, P.L.J.; Botelho, S.S.C. Matching color aerial images and underwater sonar images using deep learning for underwater localization. IEEE Robot. Autom. Lett. 2020, 5, 6365–6370. [Google Scholar] [CrossRef]
Sung, M.; Kim, J.; Lee, M.; Kim, B.; Kim, T.; Kim, J.; Yu, S.-C. Realistic sonar image simulation using deep learning for underwater object detection. Int. J. Control. Autom. Syst. 2020, 18, 523–534. [Google Scholar] [CrossRef]
Wang, X.; Jiao, J.; Yin, J.; Zhao, W.; Han, X.; Sun, B. Underwater sonar image classification using adaptive weights convolutional neural network. Appl. Acoust. 2019, 146, 145–154. [Google Scholar] [CrossRef]
Karimanzira, D.; Renkewitz, H.; Shea, D.; Albiez, J. Object detection in sonar images. Electronics 2020, 9, 1180. [Google Scholar] [CrossRef]
Cervenka, P.; De Moustier, C. Sidescan sonar image processing techniques. IEEE J. Ocean. Eng. 1993, 18, 108–122. [Google Scholar] [CrossRef]
Chungath, T.T.; Nambiar, A.M.; Mittal, A. Transfer learning and few-shot learning based deep neural network models for underwater sonar image classification with a few samples. IEEE J. Ocean. Eng. 2023, 49, 294–310. [Google Scholar] [CrossRef]
Yu, Y.; Zhao, J.; Gong, Q.; Huang, C.; Zheng, G.; Ma, J. Real-time underwater maritime object detection in side-scan sonar images based on transformer-YOLOv5. Remote Sens. 2021, 13, 3555. [Google Scholar] [CrossRef]
Kim, J.; Yu, S.-C. Convolutional neural network-based real-time ROV detection using forward-looking sonar image. In 2016 IEEE/OES Autonomous Underwater Vehicles (AUV); IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Wang, Y.; Liu, J.; Yu, S.; Wang, K.; Han, Z.; Tang, Y. Underwater Object Detection based on YOLO-v3 network. In 2021 IEEE International Conference on Unmanned Systems (ICUS); IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Shafiee, M.J.; Chywl, B.; Li, F.; Wong, A. Fast YOLO: A fast you only look once system for real-time embedded object detection in video. arXiv 2017, arXiv:1709.05943. [Google Scholar] [CrossRef]
Zhao, S.; Zheng, J.; Sun, S.; Zhang, L. An improved YOLO algorithm for fast and accurate underwater object detection. Symmetry 2022, 14, 1669. [Google Scholar] [CrossRef]
Fan, X.; Lu, L.; Shi, P.; Zhang, X. A novel sonar target detection and classification algorithm. Multimed. Tools Appl. 2022, 81, 10091–10106. [Google Scholar] [CrossRef]
Le, H.T.; Phung, S.L.; Chapple, P.B.; Bouzerdoum, A.; Ritz, C.H.; Tran, L.C. Deep gabor neural network for automatic detection of mine-like objects in sonar imagery. IEEE Access 2020, 8, 94126–94139. [Google Scholar] [CrossRef]
Wang, Z.; Guo, J.; Zeng, L.; Zhang, C.; Wang, B. MLFFNet: Multilevel feature fusion network for object detection in sonar images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5119119. [Google Scholar] [CrossRef]
Zhou, T.; Si, J.; Wang, L.; Xu, C.; Yu, X. Automatic detection of underwater small targets using forward-looking sonar images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4207912. [Google Scholar] [CrossRef]
Phung, S.L.; Nguyen, T.N.A.; Le, H.T.; Chapple, P.B.; Ritz, C.H.; Bouzerdoum, A.; Tran, L.C. Mine-like object sensing in sonar imagery with a compact deep learning architecture for scarce data. In 2019 Digital Image Computing: Techniques and Applications (DICTA); IEEE: Piscataway, NJ, USA, 2019; pp. 1–7. [Google Scholar]
Huang, C.; Zhao, J.; Yu, Y.; Zhang, H. Comprehensive sample augmentation by fully considering SSS imaging mechanism and environment for shipwreck detection under zero real samples. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5906814. [Google Scholar] [CrossRef]
Zhou, X.; Zhou, Z.; Tian, K. Prototype Relation Embedding and Contrastive Learning for Improved Few-Shot Object Detection in Sonar Images. In 2023 IEEE 11th Joint International Information Technology and Artificial Intelligence Conference (ITAIC); IEEE: Piscataway, NJ, USA, 2023; Volume 11. [Google Scholar]
Jiao, W.; Zhang, J. Sonar images classification while facing long-tail and few-shot. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4210420. [Google Scholar] [CrossRef]
Talukdar, J.; Gupta, S.; Rajpura, P.S.; Hegde, R.S. Transfer learning for object detection using state-of-the-art deep neural networks. In 2018 5th International Conference on Signal Processing and Integrated Networks (SPIN); IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Tang, Y.; Jin, S.; Bian, G.; Zhang, Y. Shipwreck target recognition in side-scan sonar images by improved YOLOv3 model based on transfer learning. IEEE Access 2020, 8, 173450–173460. [Google Scholar] [CrossRef]
Zhang, H.; Tian, M.; Shao, G.; Cheng, J.; Liu, J. Target detection of forward-looking sonar image based on improved YOLOv5. IEEE Access 2022, 10, 18023–18034. [Google Scholar] [CrossRef]
Ji, H.; Zhu, D.; Chen, M. A Side-Scan Sonar Seabed Target Detection Algorithm Based on YOLOv8-RDE. Int. J. Distrib. Sens. Netw. 2025, 1, 6543345. [Google Scholar] [CrossRef]
Gao, Y.; Li, Z.; Zhang, K.; Kong, L. GCP-YOLO: A lightweight underwater object detection model based on YOLOv7. J. Real-Time Image Proc. 2025, 22, 3. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Patwardhan, N.; Marrone, S.; Sansone, C. Transformers in the real world: A survey on nlp applications. Information 2023, 14, 242. [Google Scholar] [CrossRef]
Le, N.Q.K. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023, 23, 2300011. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Lucas, B.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Peng, Z.; Guo, Z.; Huang, W.; Wang, Y.; Xie, L.; Jiao, J.; Tian, Q.; Ye, Q. Conformer: Local features coupling global representations for visual recognition. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Graham, B.; EI-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jegou, H.; Douze, M. Levit: A vision transformer in convnet’s clothing for faster inference. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Swizerland, 2020. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Torrey, L.; Shavlik, J. Transfer learning. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; IGI Global: Hershey, PA, USA, 2010; pp. 242–264. [Google Scholar]
Lim, J.J.; Salakhutdinov, R.R.; Torralba, A. Transfer learning by borrowing examples for multiclass object detection. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2011; Volume 24. [Google Scholar]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning; PMLR: Long Beach, CA, USA, 2019. [Google Scholar]
Hsu, H.-K.; Yao, C.-H.; Tsai, Y.-H.; Hung, W.-C.; Tseng, H.-Y.; Singh, M.; Yang, M.-H. Progressive domain adaptation for object detection. In IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
FLIR Systems. FLIR Thermal Dataset for Algorithm Training [DB/OL]. Available online: https://www.flir.com/oem/adas/adas-dataset-form/ (accessed on 14 March 2026).

Figure 1. RT-DETR network architecture.

Figure 2. T2C-DETR network architecture.

Figure 3. Transformer + Convolution dual-channel backbone network architecture diagram. C2F denotes Cross Stage Partial Network Fusion, which is the main feature extraction module of YOLOv8. (Arrows indicate data flow direction. Light green blocks denote stride=1 convolutions; light yellow blocks denote stride=2 convolutions (downsampling). Left branch: CNN channel for local feature extraction; Right branch: Swin Transformer channel for global context modeling. Concat: feature fusion; 1×1 Conv: channel projection.)

Figure 4. NFM network architecture. The upper branch employs standard

3 \times 3

convolution to preserve local structural responses, while the lower branch combines squeeze-and-excitation (SE) units with depthwise separable convolution to model channel-wise dependencies efficiently. The SE unit adaptively recalibrates channel weights to suppress noise-dominated channels, whereas the depthwise separable convolution reduces computational cost while maintaining spatial filtering capability.

Figure 4. NFM network architecture. The upper branch employs standard

3 \times 3

convolution to preserve local structural responses, while the lower branch combines squeeze-and-excitation (SE) units with depthwise separable convolution to model channel-wise dependencies efficiently. The SE unit adaptively recalibrates channel weights to suppress noise-dominated channels, whereas the depthwise separable convolution reduces computational cost while maintaining spatial filtering capability.

Figure 5. Training strategy flowchart. Stage 1 pre-trains the full detector on a source dataset; Stage 2 freezes all modules except NFM and performs denoising adaptation; Stage 3 freezes backbone/NFM and fine-tunes task-specific heads on the sonar dataset. The workflow is repeated for each pre-training source.

Figure 6. Representative examples from the self-built sonar dataset. From left to right: mine target, sunken-ship target, and crashed-airplane target. The figure illustrates typical sonar characteristics, including low contrast, blurred boundaries, and background clutter.

Figure 7. Representative noise-augmented samples used for NFM-only adaptation. From left to right: speckle-like corruption, blur-like degradation, and mixed local perturbation.

Table 1. Comparison of results of various target detectors pretrained on the COCO 2017 dataset (Baseline: RT-DETR [39]; YOLOv5-Imp: [35]; MLFFNet: [27]; T2C-DETR: Ours).

Detector	mAP@0.5	FPS
T2C-DETR	97.8%	72
Baseline	97.1%	70
YOLOv5-Imp	96.5%	65
MLFFNet	96.4%	63

Table 2. Comparison of results of various target detectors pretrained on the DOTA dataset (Baseline: RT-DETR [39]; YOLOv5-Imp: [35]; MLFFNet: [27]; T2C-DETR: Ours).

Detector	mAP@0.5	FPS
T2C-DETR	98.2%	73
Baseline	97.0%	70
YOLOv5-Imp	96.9%	64
MLFFNet	97.2%	64

Table 3. Comparison of results of various target detectors pretrained on the Infrared image dataset (Baseline: RT-DETR [39]; YOLOv5-Imp: [35]; MLFFNet: [27]; T2C-DETR: Ours).

Detector	mAP@0.5	FPS
T2C-DETR	98.5%	72
Baseline	97.6%	73
YOLOv5-Imp	97.3%	66
MLFFNet	97.4%	62

Table 4. Comparison of ablation experiment results when pretrained on the COCO 2017 dataset.

TCDCNet	NFM	mAP@0.5
		97.1%
✓		97.6%
	✓	97.3%
✓	✓	97.8%

Table 5. Comparison of ablation experiment results when pretrained on the DOTA aerial dataset.

TCDCNet	NFM	mAP@0.5
		97.0%
✓		97.9%
	✓	97.6%
✓	✓	98.2%

Table 6. Comparison of ablation experiment results when pretrained on the Infrared image dataset.

TCDCNet	NFM	mAP@0.5
		97.6%
✓		98.3%
	✓	98.0%
✓	✓	98.5%

Table 7. Cross-source statistical summary of detection performance and efficiency.

Method	AP@0.5 Mean (%)	AP@0.5 Std (%)	FPS Mean	AP@0.5×FPS
T2C-DETR	98.17	0.35	72.33	7099
Baseline (RT-DETR)	97.23	0.32	71.00	6904
YOLOv5-Imp	96.90	0.40	65.00	6298
MLFFNet	97.00	0.53	63.00	6111

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, X.; Tan, P.; Zhang, X.; Sun, H. T2C-DETR: A Transformer + Convolution Dual-Channel Backbone Network for Underwater Sonar Image Object Detection. Algorithms 2026, 19, 281. https://doi.org/10.3390/a19040281

AMA Style

Wu X, Tan P, Zhang X, Sun H. T2C-DETR: A Transformer + Convolution Dual-Channel Backbone Network for Underwater Sonar Image Object Detection. Algorithms. 2026; 19(4):281. https://doi.org/10.3390/a19040281

Chicago/Turabian Style

Wu, Xiaobing, Panlong Tan, Xiaoyu Zhang, and Hao Sun. 2026. "T2C-DETR: A Transformer + Convolution Dual-Channel Backbone Network for Underwater Sonar Image Object Detection" Algorithms 19, no. 4: 281. https://doi.org/10.3390/a19040281

APA Style

Wu, X., Tan, P., Zhang, X., & Sun, H. (2026). T2C-DETR: A Transformer + Convolution Dual-Channel Backbone Network for Underwater Sonar Image Object Detection. Algorithms, 19(4), 281. https://doi.org/10.3390/a19040281

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

T2C-DETR: A Transformer + Convolution Dual-Channel Backbone Network for Underwater Sonar Image Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Transformer

2.2. RT-DETR

2.3. Transfer Learning

3. Method

3.1. T2C-DETR

3.2. Transformer + Convolution Dual-Channel Backbone Network

3.3. NFM Module

3.4. Specialized Training Strategy

4. Experiment

4.1. Configuration

4.2. Execution Details

4.3. Result Analysis

4.4. Ablation Experiment

4.5. Statistical Analysis and Reproducibility

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI