FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception

Yang, Shibo; Zhang, Xiaoyu; Tan, Panlong

doi:10.3390/jmse14030304

Open AccessArticle

FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception

by

Shibo Yang

¹,

Xiaoyu Zhang

^1,* and

Panlong Tan

²

¹

College of Artificial Intelligence, Nankai University, Tianjin 300350, China

²

Haihe Lab of ITAI, Tianjin 300459, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(3), 304; https://doi.org/10.3390/jmse14030304

Submission received: 19 December 2025 / Revised: 9 January 2026 / Accepted: 14 January 2026 / Published: 4 February 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

In practical underwater object detection tasks, imbalanced sample distribution and the scarcity of samples for certain classes often lead to insufficient model training and limited generalization capability. To address these challenges, this paper proposes FS2-DETR (Few-Shot Detection Transformer for Sonar Images), a transformer-based few-shot object detection network tailored for sonar imagery. Considering that sonar images generally contain weak, small, and blurred object features, and that data scarcity in some classes can hinder effective feature learning, the proposed FS2-DETR introduces the following improvements over the baseline DETR model. (1) Feature Enhancement Compensation Mechanism: A decoder-prediction-guided feature resampling module (DPGFRM) is designed to process the multi-scale features and subsequently enhance the memory representations, thereby strengthening the exploitation of key features and improving detection performance for weak and small objects. (2) Visual Prompt Enhancement Mechanism: Discriminative visual prompts are generated to jointly enhance object queries and memory, thereby highlighting distinctive image features and enabling more effective feature capture for few-shot objects. (3) Multi-Stage Training Strategy: Adopting a progressive training strategy to strengthen the learning of class-specific layers, effectively mitigating misclassification in few-shot scenarios and enhancing overall detection accuracy. Extensive experiments conducted on the improved UATD sonar image dataset demonstrate that the proposed FS2-DETR achieves superior detection accuracy and robustness under few-shot conditions, outperforming existing state-of-the-art detection algorithms.

Keywords:

DETR; feature enhancement compensation mechanism; visual prompt enhancement mechanism; multi-stage training strategy; sonar image object detection

1. Introduction

Sonar image object detection, as a key technology in underwater perception and detection missions, has been widely applied in fields such as the recovery of sunken ships and aircraft wreckage, submarine monitoring, seabed resource exploration, and underwater security [1,2,3]. Compared with optical imaging, sonar imaging can reliably capture object information in complex underwater environments where low light levels and high turbidity prevail, providing it with unique advantages in both military and civilian applications [4,5]. However, constrained by underwater acquisition conditions and mission complexity, sonar image datasets often suffer from limited sample sizes and severe class imbalance, with some critical object classes being particularly scarce. These limitations hinder deep learning models from fully learning discriminative object features, thereby restricting both generalization capability and detection accuracy [6,7]. Consequently, achieving efficient and robust object detection in sonar images under few-shot conditions has become a critical challenge in the field of underwater intelligent perception.

Although deep learning-based detectors achieve remarkable performance when sufficient annotated data are available, many real-world scenarios provide only a few annotated samples for novel classes or critical objects. This challenge has motivated researchers to explore how detectors can acquire few-shot learning capabilities to maintain robust performance under data-scarce conditions [8,9]. In response, numerous few-shot object detection frameworks have been proposed in recent years [10,11]. Meta Faster R-CNN [12] introduces a meta-learning-based approach built upon the Faster R-CNN framework, incorporating a prototype matching network and an attention-based feature alignment mechanism to enhance candidate box quality and detection accuracy for few-shot classes; FSCE [13] integrates contrastive learning into few-shot object detection by establishing a supervised contrastive mechanism at the proposal level, improving feature discriminability and robustness, significantly boosting detection performance for novel classes; DeFRCN [14] addresses the multi-stage and multi-task coupling problems inherent in classical Faster R-CNN under few-shot scenarios by introducing a gradient decoupled layer and a prototypical calibration block, which decouples feature propagation and classification calibration and substantially improves detection performance; and B-FSDet [15], built upon the YOLOv9 architecture, combines meta-learning with a balanced few-shot detection strategy, ensuring balanced classes representation through data cleaning and introducing a steady-state feature extraction module with a fast prediction mechanism. Meta-DETR [16] incorporates image-level meta-learning within the DETR framework, leveraging class feature encoding and the semantic alignment mechanism (SAM), which unifies object localization and classification. FS-DETR [17] enables flexible detection of any number of new classes and samples without fine-tuning by using visual templates of new classes as prompts during testing, which are integrated with pseudo-class embeddings to generate predictions. Although these approaches have demonstrated significant progress in optical image detection, underwater sonar imagery is characterized by numerous weak objects with indistinct boundaries, together with severe class scarcity, which makes the direct transfer of existing methods challenging. These challenges underscore the urgent need for few-shot object detection research specifically tailored to sonar imagery.

In complex real-world underwater detection tasks, sonar image object detection holds significant practical value. However, objects in sonar images are typically small, exhibit blurred edges, and are often affected by severe background noise interference. These factors pose substantial challenges to deep learning–based detection models in terms of effective feature extraction and accurate object localization [18,19,20]. In recent years, several studies have achieved remarkable progress by refining classical detection frameworks or developing novel network architectures. For instance, Zhang et al. [21] improved YOLOv5 for forward-looking sonar image detection by leveraging transfer learning and anchor optimization, achieving better accuracy and efficiency for sonar objects. Similarly, Wang et al. [22] proposed MLFFNet, which addresses background interference and scale-related challenges in sonar images through multi-scale feature fusion and attention mechanisms. Zhao et al. [23] developed a multi-scale feature enhancement framework that incorporates composite backbones, attention mechanisms, and feature fusion strategies, enabling robust detection of underwater objects with extreme aspect ratios and arbitrary orientations. Palomeras et al. [24] designed an automatic object recognition system that integrates CNN-based detectors and classifiers with probabilistic grid maps, facilitating efficient detection of landmine-like objects. While these methods have achieved favorable results, their effectiveness typically relies on the availability of large and diverse training datasets. In real underwater environments, acquiring sonar image data is costly and often results in imbalanced class distributions, which limits the generalization ability of detectors. To address this limitation, recent studies have also explored applying few-shot learning to sonar image analysis. For instance, Ghavidel et al. [25] investigated sonar object classification in a few-shot setting and introduced a concept-based feature extraction scheme that combines wavelet denoising and short-time Fourier transform (STFT), leading to notable performance gains with scarce annotations. Grijalva et al. [26] further investigated self-supervised learning for underwater perception as a means to alleviate annotated data scarcity. By comparing RotNet, Denoising Autoencoder (DAE), and Jigsaw networks, they validated the effectiveness of self-supervised learning for few-shot sonar image classification. However, few studies have focused on few-shot object detection in sonar imagery, where achieving high-precision detection under extremely limited samples remains a formidable challenge.

To bridge this gap, we propose FS2-DETR, a transformer-based framework tailored for few-shot object detection in underwater sonar imagery. RT-DETR [27] serves as the architectural reference for the proposed model, with its end-to-end compact design facilitating the integration of few-shot learning modules and offering a favorable balance between efficiency and accuracy for underwater detection. In response to the issue of limited training samples and imbalanced class distribution, FS2-DETR introduces three key improvements: First, a feature enhancement compensation mechanism is introduced, in which a dedicated DPGFRM leverages the predictions of the decoder to refine the encoder memory, allowing critical information from limited samples to be more effectively preserved and utilized; Second, an optimized visual prompt mechanism is incorporated, where strengthened visual prompts are used to jointly guide object queries and memory representations, enabling the detector to better focus on distinctive image features; Third, to alleviate the misclassification commonly encountered in few-shot learning, an efficient multi-stage training strategy is adopted to progressively strengthen the learning of class recognition layers, thereby promoting better inter-class separability among few-shot classes. It is crucial to emphasize that the core objective of these improvements is to enhance feature utilization efficiency for few-shot classes, thereby boosting the detector’s learning and generalization capabilities under data scarcity conditions. Concurrently, these enhancements also yield additional benefits for detecting weak objects. Overall, the main contributions of this paper include the following aspects:

We introduce FS2-DETR, a DETR-based framework for few-shot object detection in sonar images, which explicitly addresses the dual challenges of data scarcity and degraded object features. By integrating tailored architectural designs with optimized training strategies, the proposed framework achieves more stable and generalizable detection performance under extremely limited supervision.
A feature enhancement compensation mechanism is proposed to reinforce key feature learning. By using the dedicated DPGFRM, the predictions of the decoder are employed to augment the encoder memory. This design effectively captures latent semantic cues from limited samples and alleviates insufficient feature representation in few-shot classes.
A visual prompt enhancement mechanism is proposed to improve feature interaction and representation learning. By using optimized visual prompts, object queries and encoder memory are jointly strengthened, effectively highlighting salient regions and key features in sonar images. This facilitates more efficient capture of semantic distinctions among few-shot objects, enhancing the detector’s sensitivity and inter-class discrimination.
A multi-stage training strategy is proposed to address class confusion in few-shot object detection. By progressively reinforcing class recognition across successive training phases, this approach effectively reduces misclassifications and enhances the model’s robustness under few-shot conditions.

The remainder of this paper is organized as follows. Section 2 reviews the related work, Section 3 details the proposed method, Section 4 reports experiments validating its effectiveness, and Section 5 concludes the paper.

2. Related Work

2.1. DETR

Carion et al. first introduced DETR [28] in 2020, which adopts a transformer-based encoder–decoder architecture and employs a fixed number of object queries for multi-object prediction. By removing manually designed components such as anchor generation and non-maximum suppression used in traditional detectors, DETR established a truly end-to-end object detection framework. As illustrated in Figure 1. DETR attracted widespread attention for its structural simplicity and conceptual innovation. However, the original DETR relied on a single-layer feature map from the backbone for object prediction. In addition, the Transformer module exhibited a strong dependence on the scale of the training data, resulting in slow convergence and suboptimal performance, particularly for small object detection. To address these limitations, subsequent studies have proposed various improved variants to enhance DETR’s training efficiency and detection performance [29,30]. Deformable DETR [31] employs a deformable attention mechanism that concentrates on sparse key points, significantly reducing computational complexity and enhancing both convergence speed and detection accuracy. Conditional DETR [32] and Anchor-DETR [33] tackled the initialization of object queries, establishing stronger spatial consistency with object locations to effectively reduce optimization complexity. Furthermore, DAB-DETR [34] and DN-DETR [35] enhanced training stability and model robustness by refining positional encoding and incorporating denoising training strategies. Building upon these advancements, DINO [36] integrates their strengths, achieving superior performance through optimized denoising mechanisms and improved gradient propagation strategies. In recent years, models such as Efficient DETR [37] and Sparse DETR [38], among others, including RT-DETR, have made steady progress in lightweight design and accelerated inference. Through a careful trade-off between accuracy and inference efficiency, these models have promoted the widespread application of transformer architectures in practical detection tasks. Notably, RT-DETR innovatively replaces traditional cross-scale interactions among multi-scale features with intra-scale interactions and cross-scale fusion, thereby improving feature integration efficiency. It achieves a new equilibrium between detection accuracy and inference speed, making it a preferred solution for real-world deployment scenarios.

Overall, the DETR series of detectors have demonstrated outstanding performance in optical image detection. However, their transformer modules generally rely on large-scale datasets to achieve satisfactory performance and generalization, which significantly limits their applicability in domains with limited data, such as medical imaging and sonar imagery. Notably, the end-to-end architecture of DETR exhibits strong modularity and scalability, enabling task-specific customization and facilitating model transfer. Building upon this foundation, this paper adopts RT-DETR as the base model and introduces modular structural improvements and policy-based optimization strategies to achieve accurate detection of weak objects in sonar images under few-shot conditions.

2.2. Few-Shot Object Detection

Few-shot learning (FSL) seeks to achieve effective feature representation and object recognition using only a limited number of labeled examples. Its core objective is to preserve model generalization and ensure robust inference performance under conditions of data scarcity [39,40]. In object detection, few-shot object detection further extends this challenge by requiring models to simultaneously classify and localize objects using limited training data, positioning it as a key research focus in modern computer vision [41,42]. Classical few-shot object detection techniques can be broadly classified into meta-learning-based [12,14] and fine-tuning-based approaches [13,43], reflecting different strategies for learning from limited samples. The central idea of meta-learning is “learning to learn,” where models are trained across multiple tasks to rapidly adapt to new classes and detection scenarios, thereby achieving strong detection performance even under extremely limited data conditions. Meta Faster R-CNN [12], for instance, integrates prototype matching and attention alignment mechanisms within the Faster R-CNN framework, effectively improving proposal quality and classification accuracy for few-shot classes. Furthermore, methods such as wDAE-GNN [44] and TIP [45] extend these ideas by incorporating graph neural network (GNN) modeling and transformation-invariant constraints, thereby enhancing class transfer and feature generalization capabilities and providing more adaptive solutions for few-shot detection. Fine-tuning-based methods draw inspiration from the principles of transfer learning. By pre-training detection models on large-scale datasets and subsequently fine-tuning them with a limited number of object samples, these approaches enable effective knowledge transfer and rapid adaptation. Early work, such as LSTD [46], incorporated multi-detection structures and regularization strategies to mitigate feature discrepancies between source and target domains. Later methods, including TFA [43], DeFRCN [14], and Meta-DETR [16], further enhanced detection stability on few-shot classes and improved cross-class transferability through mechanisms such as feature decoupling, classification head re-training, and feature reconstruction. Moreover, studies such as FSCE [13], MPSR [47], and Retentive R-CNN [48] introduced contrastive learning, sample reweighting, and class bias correction mechanisms, which effectively alleviated confusion and misclassification in few-shot detection by improving both feature discriminability and class consistency.

In recent years, the non-retraining paradigm has emerged as a promising research direction in few-shot object detection. Its primary objective is to enable the detection of novel classes solely through the generic knowledge acquired during base-class training, without the need for additional fine-tuning or retraining when new classes arise. For instance, FS-DETR [17] allows direct input of visual templates for new classes as prompts after base-class training, facilitating rapid detection through visual guidance. Similarly, AirDet [49] learns class prototypes and cross-scale support guidance networks within a parallel framework, achieving superior inter-class generalization by performing feature association and bounding box regression through its detection heads. Compared with fine-tuning- and meta-learning-based approaches, this class of methods offers greater flexibility and scalability during inference.

Among existing approaches, meta-learning methods typically exhibit strong generalization ability and rapid adaptation to novel classes; however, their training procedures are often complex and may suffer from instability. Fine-tuning-based methods usually achieve higher detection accuracy but are prone to overfitting under few-shot settings. In contrast, non-retraining paradigms offer greater flexibility and plug-and-play characteristics, but their heavy reliance on visual prompts not only increases storage and computational demands but may also compromise detection effectiveness due to suboptimal prompt quality. Taking these considerations into account, we choose a fine-tuning-based strategy to construct the few-shot detection architecture. Visual prompts are incorporated during the training phase to enhance feature utilization, while this mechanism is removed during inference to balance training performance and deployment efficiency.

3. Method

Figure 2 shows the architecture of FS2-DETR, which extends RT-DETR with three improvements aimed at enhancing its few-shot object detection capabilities. The first of these, a feature enhancement mechanism is proposed to improve the reuse of critical contextual information. It leverages a decoder-prediction-guided feature resampling module to resample and refine multi-scale feature maps within bounding boxes, and integrates these enhanced features into the memory. In parallel, a dual-enhanced visual prompt mechanism is introduced, where prompts extracted from template images interact with the enhanced memory via cross-attention to strengthen class-specific semantic representations. These prompts also act as additional object queries in the decoder’s self-attention, further improving class awareness and information exchange. Finally, a multi-stage training strategy is designed, inspired by the observation that projection layers and classification heads are more class-sensitive than other modules [50], this strategy progressively emphasizes these sensitive layers, improving their adaptability to novel classes while reducing catastrophic forgetting and overfitting. Apart from these modifications, the rest of the detector remains unchanged, following the same implementation settings as RT-DETR.

3.1. Preliminary

Given two distinct sets of classes,

C_{b a s e}

and

C_{n o v a l}

, where

C_{b a s e} \cap C_{n o v a l} = \emptyset

. Few-shot object detection aims to build a detector that is first fully trained on a base class set

C_{b a s e}

with abundant annotated samples, and then fine-tuned on a novel class set

C_{n o v a l}

containing only a few annotated instances—optionally including a small portion of base-class samples to stabilize learning—so that it can accurately detect and localize all objects in both

C_{b a s e}

and

C_{n o v a l}

. In a k-shot detection scenario, only k annotated samples are provided for each novel class in

C_{n o v a l}

during fine-tuning.

3.2. Memory Feature Enhancement Compensation Mechanism

Objects in sonar images are typically small, with blurred edges, and their salient regions are easily obscured by background noise. This makes it difficult for detectors to focus on key object areas during feature learning, limiting their ability to represent small objects. In few-shot scenarios, the scarcity of samples further hinders the model’s ability to extract stable and discriminative feature representations. Efficiently mining and fully utilizing key information from limited images is therefore critical for improving detection performance. In DETR, the decoder and its associated detection head generate high-quality prediction boxes corresponding to potential object locations, and the regions covered by these boxes contain rich semantic information. Strengthening the representation of these regions during feature extraction can significantly enhance the model’s ability to learn and utilize key object features. Motivated by this, we propose a feature enhancement compensation strategy based on decoder predictions, aimed at improving the representation of key semantic information in memory features.

Specifically, the decoder in DETR consists of multiple stacked layers, where the object queries in each layer interact with the memory features to extract semantic information related to the objects. To further enhance this information interaction, we design the DPGFRM, which resamples multi-scale feature maps produced by the backbone after each decoder to strengthen the representation of object regions. As illustrated in Figure 3, we first select the top-k predictions with the highest confidence scores from the current decoder layer. The corresponding spatial locations are then projected back onto the multi-scale feature maps, from which the associated feature regions are extracted. These region features are subsequently fed into the proposed DPGFRM for refinement. The internal structure of DPGFRM is depicted in Figure 4, where the resampled features are processed by a lightweight encoder, enabling effective modeling of dependencies among key tokens within object regions. The refined features are then projected into a representation space consistent with the memory. Finally, these enhanced features are concatenated with the original memory features and incorporated into the cross-attention computation of the subsequent decoder layer, thereby facilitating more effective reuse of critical contextual information. The specific implementation steps are as follows:

F_{i}^{'} = S e l f . A t t n (F l a t t e n (F_{i})), i = 1, 2, \dots L

(1)

T = C o n c a t (M^{'}, F_{i}^{'})

(2)

where

F_{i}

denotes the set of features resampled from the i-th feature map, L represents the total number of feature map layers output by the backbone,

F_{i}^{'}

corresponds to the region features after self-attention processing, and

M^{'}

represents the memory features outputed by PAN-Fusion and enriched through interaction with visual prompts. The final enhanced memory features, T, are obtained by concatenating

F_{i}^{'}

and

M^{'}

, and it subsequently interact with the object queries in the next decoder. As the decoder iterates across layers, predictions are progressively refined, and the resampled regions become increasingly precise, leading to a gradual strengthening of object region representations within the memory. This mechanism effectively enhances the detector’s feature extraction capability and overall detection performance in few-shot and small-object scenarios, without imposing a substantial computational overhead.

3.3. Visual Prompt Enhancement Mechanism

In few-shot sonar object detection tasks, insufficient sample numbers and minor differences across classes often hinder the detector from effectively learning class-specific discriminative features during training. This challenge is further exacerbated under complex background noise and blurred object boundaries, where the model tends to suffer from feature confusion and unstable class prediction. As a result, the encoder-extracted representations are dominated by general semantics and lack explicit category awareness. The visual prompt mechanism, as an efficient feature-guided strategy, addresses this issue by introducing explicit feature cues (prompts) derived from template images, without requiring additional class samples. These visual prompts provide class-related guidance signals that significantly enhance the model’s discriminative capability under limited data conditions [16,17]. By injecting salient semantic information from scarce samples into the feature learning process, this mechanism helps the detector concentrate on key features relevant to each class, thereby improving overall detection performance.

Considering the advantages of the visual prompt mechanism and the feature learning bottlenecks in few-shot sonar object detection, this study proposes a dual-level enhancement mechanism based on visual prompts to strengthen the model’s perception and utilization of class-specific features. As illustrated in Figure 2, we assume a set of template images denoted as

V_{i, j} \in R^{W_{T} \times H_{T} \times 3}, i = 1, 2, \dots c, j = 1, 2, \dots k

, where c represents the number of object classes and k denotes the number of template images used per class,

W_{T}, H_{T}

represents the length and width of the template image respectively. From the template set, we randomly select a total of n template images corresponding to the object classes present in the current image, and another n template images drawn from the remaining classes that do not appear in the current image,

3 ⩽ n ⩽ k

. These selected templates are then encoded by the backbone B, which shares weights with the detector, and subsequently global-pooled to generate the visual prompt:

P = Ave . P o o l (B (V)), P \in R^{2 n \times d}

(3)

here, Ave.Pool refers to the operation of global average pooling, and P contains the shared semantic representations of all object classes. The visual prompt serves two primary functions within the model: first, it interacts with the memory features through a cross-attention mechanism to facilitate semantic information exchange

M^{'} = C r o s . A t t n (M, P) + M

(4)

where M denotes the original memory feature produced by the PAN-Fusion module. Through feature interaction, class-specific semantic information within the memory is enhanced, while irrelevant class and background semantics are suppressed. This process enables the subsequent decoding stage to focus more precisely on object regions. Second, the visual prompt is regarded as an additional set of object queries, which are fed into the decoder together with the original queries

O^{'} = [\begin{matrix} P & O \end{matrix}], O^{'} \in R^{(2 n + N) \times d}

(5)

O \in R^{N \times d}

denotes the original object queries, while P and O share the same initialization of positional encoding. Within each decoder layer, the object queries first perform self-attention–based feature interaction, enabling the original queries to acquire class-specific perception capability. Subsequently, they engage in multi-head cross-attention with the enhanced memory features T, during which the object queries are guided to focus on discriminative features of specific classes, thereby improving the efficiency of information flow. The detailed implementation is as follows:

\begin{matrix} O^{″} = S e l f . A t t n (L N (O^{'})) + O^{'} \end{matrix}

(6)

\begin{matrix} O^{‴} = C r o s . A t t n (L N (O^{″}), T) + O^{″} \end{matrix}

(7)

\begin{matrix} O^{f} = M L P (L N (O^{‴})) + O^{‴} \end{matrix}

(8)

where

O^{f}

denotes the the final output object queries of the current decoder layer.

O^{″}

and

O^{‴}

denote the object queries produced by the decoder’s self-attention and cross-attention layers, respectively, both of which adopt the deformable attention mechanism.

Unlike previous studies, the proposed visual prompt mechanism does not rely on pseudo-class encoding–based prediction. By selecting an appropriate number of positive and negative class template images, the detector is encouraged to learn class-relevant features while simultaneously capturing discriminative differences between classes that are present in the image and those that are absent. This design enhances the detector’s ability to model blurred object features and to distinguish visually confusing objects under few-shot conditions. Overall, the proposed visual prompt enhancement mechanism establishes bidirectional guidance from the feature level to the query level, thereby significantly improving the utilization efficiency of class-specific features in limited-data regimes and effectively alleviating class confusion and feature ambiguity commonly encountered in sonar image object detection. It is worth noting that the visual prompts are used solely for feature guidance and do not participate in subsequent detection head mapping or bipartite matching. Moreover, since the model parameters have already absorbed the structural and semantic priors provided by the prompts during training, the use of template information is removed during inference, having no impact on the detector’s runtime.

3.4. Multi-Stage Training Strategy

Previous studies [50] have shown that, in object detection tasks, certain projection layers within neural networks are more sensitive to class-specific semantic features and exhibit strong class relevance, whereas other layers or modules primarily focus on spatial structural information and are relatively insensitive to class variations. Therefore, in fine-tuning–based few-shot learning, selectively freezing class-irrelevant modules while updating only class-sensitive connection layers or high-level semantic modules can significantly enhance class discrimination ability while maintaining feature stability. In contrast, although meta-learning methods offer rapid adaptability to new tasks, their generalization performance in real-world applications is often constrained by distribution discrepancies across tasks. Fine-tuning methods, on the other hand, can fully exploit the general representations learned by pre-trained models and achieve more stable optimization through staged and hierarchical training strategies. Based on these considerations, this study proposes a multi-stage training mechanism designed to achieve efficient parameter adaptation and synergistic improvement in detection performance under limited-sample conditions.

Through extensive comparative experiments, we found that keeping the modules highlighted in blue in Figure 2 unfrozen during the fine-tuning stage—namely, the projection layers following the backbone, the query projection layers for object query initialization, the DPGFRM, and the classifier—leads to superior detection performance. The detailed results are presented in the ablation study section. In the training process, a multi-stage training strategy was employed. First, the entire network was trained end-to-end using a large set of fully annotated samples containing only

C_{b a s e}

, allowing the model to learn general visual features of sonar images, establish robust feature representations, and achieve efficient recognition of the base classes. Subsequently, during the fine-tuning stage, all parameters except for the blue-highlighted modules were frozen, and only the class-sensitive projection layers were updated. In this stage, the model was specifically optimized using a small number of annotated samples containing both

C_{b a s e}

and

C_{n o v a l}

. Moreover, in both training stages, the visual prompt enhancement mechanism described in the previous section was applied, enabling the class-sensitive layers to fully capture discriminative features across different classes and strengthen the distinction of class-specific semantic information.

This training scheme effectively prevents excessive updating of general features during few-shot training, thereby maintaining the stability of the pre-trained feature space. At the same time, the staged and hierarchical training approach enables a gradual transition from general feature learning to class-specific feature optimization, promoting the synergistic enhancement of feature representation and task adaptability. Overall, the multi-stage training strategy achieves a well-balanced trade-off between stability and adaptability, significantly improving both detection accuracy and robustness in few-shot sonar image object detection tasks.

4. Experiments

4.1. Dataset

We evaluated the performance of the proposed FS2-DETR on the publicly available UATD sonar image dataset [51] and compared it against several state-of-the-art few-shot object detectors. UATD is a forward-looking sonar image dataset containing approximately 9200 precisely annotated images, covering ten representative object classes: cube, ball, cylinder, human body, distressed airplanes, circular cage, square cage, metal bucket, tyre, and blueROV. It provides a suitable benchmark for evaluating object detection model performance. However, the dataset contains a certain proportion of images with abnormal aspect ratios, which can lead to unstable detection results when applying preprocessing operations such as random cropping or resizing. To improve data quality and balance class distributions, we first preprocessed the original dataset by resizing all images to a uniform resolution of (640, 1280) pixels and removing low-quality samples. Representative examples of different classes are shown in Figure 5. After filtering, a total of 8600 high-quality images were retained and partitioned into training, validation, and test subsets following a 7:2:1 ratio. Given the high proportion of small objects in the sonar dataset, this study evaluates the effectiveness of various object detection models using mean average precision at a 0.5 IoU threshold (

{mAP}_{50}

) and at a 0.5:0.95 IoU threshold (

mAP

), providing a comprehensive assessment of detection performance. For the experimental setup, seven classes were designated as base classes and three classes as novel classes. To mitigate the influence of random factors on the results, we adopted three different class-splitting schemes, each containing distinct combinations of base and novel classes, as presented in Table 1.

4.2. Implementation Details

Experiments were performed using the standard RT-DETR-l configuration, with all settings kept consistent except for the modifications illustrated in Figure 2. The backbone network adopts HgNet2, which is self-supervisedly pretrained on a custom sonar image dataset using SwAV [52] to enhance feature representation for sonar imagery. In addition, the remaining detector components are further pretrained in an unsupervised manner using AptDet [53], aiming to strengthen the localization capability of the overall detection framework. Training is carried out in two distinct stages. In the first stage, the model is pretrained on the training set consisting solely of base classes, with the objective of learning generalizable feature representations for sonar images. During this stage, training is performed using the AdamW optimizer with a starting learning rate of

2 \times 10^{- 4}

, dynamically adjusted via a cosine annealing schedule to ensure stable convergence. A weight decay of

1 \times 10^{- 4}

is applied, with a batch size of 4, and training proceeds for 50 epochs. No network layers are frozen in this stage to allow comprehensive semantic learning from the large-scale data.In the fine-tuning stage, only the projection layers following the backbone, the Query projection layers, the DPGFRM, and the classifier are fine-tuned. This stage aims to adapt class-specific features to novel classes. To prevent overfitting on the limited samples, the learning rate is lowered to

1 \times 10^{- 5}

while the batch size is maintained at 2. Experiments are conducted with four different k-shot settings (k = 1, 3, 5, and 10), providing the corresponding number of annotated samples per class for fine-tuning, and training for 24 epochs. Table 2 concisely presents the training configurations for the two training stages.

4.3. Analysis of Experimental Results

Table 3, Table 4 and Table 5 present the

{mAP}_{50}

comparison between our method and several classical few-shot object detection approaches on the improved UATD dataset under four k-shot settings (k = 1, 3, 5, 10) and three different data-splitting configurations. Each configuration was repeated three times, and the resulting performance variation is reported to ensure robustness. All comparative experiments follow the same experimental settings as in the original paper. Here, the baseline refers to the original RT-DETR. For fair comparison, all comparative methods adopt the same unsupervised pre-training strategy as FS2-DETR. Results across the three tables show that our proposed method almost achieves the best performance under all data-split configurations with consistently stable performance, demonstrating its stability and effectiveness in enhancing few-shot detection. In addition, we observe that DETR-based detectors consistently outperform CNN-based counterparts. This advantage can be attributed to the multi-layer attention mechanism in DETR, which enables better focus on weak and small sonar objects, thereby reducing information loss during inference. Furthermore, our method consistently outperforms both Meta-DETR and Hint-DETR under all three data split configurations. Specifically, it achieves average gains of 2.0 and 1.0

{mAP}_{50}

at 1-shot, 2.5 and 1.2

{mAP}_{50}

at 3-shot, 2.4 and 0.7

{mAP}_{50}

at 5-shot, and 1.6 and 1.0

{mAP}_{50}

at 10-shot. These improvements stem from the specialized enhancements designed for the characteristics of sonar images, which make the proposed method better suited for sonar object detection tasks.

Table 6, Table 7 and Table 8 present the mAP comparisons between our method and competing approaches under the same experimental settings. As shown in these tables, the proposed method consistently achieves near-best performance across different configurations. Compared with detectors of similar architectures, such as Meta-DETR and Hint-DETR, our method obtains average mAP improvements of 1.1 and 0.4 in the 1-shot setting, 1.2 and 0.7 in the 3-shot setting, 1.2 and 0.6 in the 5-shot setting, and 1.1 and 1.0 in the 10-shot setting, respectively. Moreover, the proposed method exhibits limited performance variation across multiple runs under different split configurations, further demonstrating its robustness and stability.

Figure 6 shows a comparison of detection results between our method and other classical approaches. As shown, our method achieves more accurate localization and recognition performance. Due to the weak and indistinct nature of sonar objects, other detectors often fail to locate objects reliably, leading to frequent missed detections and false positives. In contrast, our proposed method significantly reduces both missed detections and false alarms, further demonstrating its robustness and effectiveness in sonar image object detection.

4.4. Ablation Studies

4.4.1. Ablation on the Proposed Enhancement Modules

We further conducted ablation studies to evaluate the contribution of each improvement module in our proposed method. Experiments under the first class-split setting are reported in Table 9, with performance evaluated using

{mAP}_{50}

. These results indicate that each enhancement module contributes differently to the overall performance. Specifically, under the 1-shot setting, the memory feature enhancement compensation mechanism, visual prompt enhancement mechanism, and multi-stage training strategy improve performance by 3.9, 3.1, and 1.4

{mAP}_{50}

, respectively. Under the 3-shot setting, the three modules yield gains of 4.9, 4.3, and 2.1

{mAP}_{50}

. For the 5-shot setting, they gains are 6.1, 6.3, and 3.8

{mAP}_{50}

, while for the 10-shot setting, the improvements reach 8.2, 7.8, and 5.6

{mAP}_{50}

, respectively. When all modules are applied together, the detector achieves even greater overall performance. These findings clearly validate the effectiveness of each proposed improvement module.

4.4.2. Impact of Freezing Different Network Modules

To investigate the effectiveness of different fine-tuning strategies and to justify the layer selection adopted in FS2-DETR for few-shot sonar object detection, we conduct a comprehensive ablation study by comparing multiple freezing schemes. The objective of this experiment is to identify which network components are most critical for adapting DETR-based detectors to few-shot sonar scenarios, and to assess whether the proposed strategy is overly dependent on the specific characteristics of the UATD dataset. All experiments are conducted under Split 1 and evaluated across four few-shot settings (1-shot, 3-shot, 5-shot, and 10-shot). For consistency, all models are trained using identical training configurations, and performance is reported in terms of

{mAP}_{50}

. The detailed freezing strategies and their corresponding experimental results are summarized in Table 10. Among them, the fifth strategy corresponds to the one adopted in this work, while the fourth strategy represents its exact opposite in terms of trainable components. As can be observed from the results, the proposed strategy consistently achieves the best performance across different shot settings, whereas the fourth strategy yields the worst results, and the remaining strategies exhibit relatively inferior performance. These observations demonstrate the effectiveness of the proposed fine-tuning strategy and confirm that the selected trainable modules are empirically justified for few-shot sonar object detection.

4.4.3. Sensitivity Analysis of the Hyperparameter n

To further investigate the influence of visual prompts on the proposed FS2-DETR, we conduct a comprehensive sensitivity and robustness analysis with respect to both the number and composition of visual prompts. Specifically, the detector is evaluated under Split 1 across multiple few-shot settings using different prompt numbers

n \in \{3, 5, 10, 15\}

, and the performance obtained with only positive-class prompts is compared against that achieved with a combination of positive-class and negative-class prompts. For clarity and consistency with previous evaluations, only

{mAP}_{50}

is reported in this analysis. For the prompt number analysis, all training settings are kept unchanged except for the value of n, which is varied during training. To further assess the robustness of the method to randomness in prompt selection, multiple independent trials are conducted for each prompt number, where all prompts are randomly sampled from the training set in each run. The detailed experimental results are reported in Table 11 and Table 12.

From the results, it can be observed that incorporating negative-class visual prompts consistently yields slightly better performance than using only positive prompts. This indicates that negative prompts provide complementary background cues, which help the model suppress false positives and enhance discriminative representation learning. Moreover, as the number of visual prompts increases, the detection performance improves accordingly; however, the performance gains gradually saturate when n becomes large, suggesting that an appropriate number of prompts is sufficient to achieve effective feature enhancement. In addition, the performance variations across different n and few-shot settings remain within a limited range, demonstrating that the proposed method is not overly sensitive to specific prompt selections. These observations further confirm the robustness and generalization capability of FS2-DETR with respect to prompt configuration.

4.5. Computational Complexity and Efficiency Analysis

To further assess the efficiency of the proposed method in practical detection scenarios, we compare it with other DETR-based detectors in terms of computational and inference-related metrics, including GFLOPs, parameter count, and inference speed (FPS), as summarized in Table 13. Combined with the detection results reported in Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8, it can be observed that although our method incurs higher computational complexity, more parameters, and slightly slower inference speed than the baseline RT-DETR, it achieves substantially higher detection accuracy under few-shot settings. Compared with Meta-DETR, the proposed method maintains a comparable parameter scale while achieving lower computational complexity, faster inference speed, and superior detection performance. Relative to Hint-DETR, our method introduces slightly higher computational complexity but delivers improved detection accuracy, together with fewer parameters and faster inference speed. Overall, the proposed method provides a more favorable trade-off among detection accuracy, computational complexity, and inference efficiency, and its relatively compact parameter scale makes it more suitable for practical deployment in real-world underwater object detection tasks.

4.6. Visualization Analysis

We performed a qualitative analysis by visualizing the enhanced memory features across different epochs of detector training, focusing on the specific image regions that provide the enhanced memory information. Additionally, we highlighted the areas of memory feature that visual prompts pay more attention to. Figure 7 and Figure 8 display the corresponding visual results.

As shown in Figure 7, the enhanced feature extraction regions within the memory feature maps progressively become more refined across training epochs, gradually converging toward the ground-truth object locations. As a result, the quality of the enhanced features improves consistently, which strengthens the information exchange between the memory feature maps and the object queries. This, in turn, enables the detector to learn more effectively.

As shown in Figure 8, the visual prompt focuses on the regions corresponding to objects of its associated class. This demonstrates its ability to highlight class-relevant features within the memory feature map. By facilitating more efficient interactions between these class-specific features and the object queries, the visual prompt improves the detector’s ability to differentiate among classes, ultimately boosting overall detection performance.

5. Conclusions

This work presents FS2-DETR, a DETR-based few-shot detection network, to address limited samples and weak or blurred small-object features in sonar images. Built upon RT-DETR and tailored to the characteristics of sonar imagery, the model introduces targeted optimizations in both network structure and training strategy. By incorporating a memory feature enhancement compensation mechanism, FS2-DETR leverages decoder outputs to reinforce key memory, effectively improving the detection of weak and small objects under limited-sample conditions. A visual prompt enhancement mechanism further augments object queries and memory features, enabling the model to fully extract and utilize salient information even with scarce training data. Additionally, a multi-stage training strategy mitigates confusion among few-shot classes, significantly enhancing classification accuracy and generalization performance. Experimental results on the enhanced UATD sonar image dataset demonstrate that FS2-DETR achieves excellent detection performance across various sample scales (k-shot = 1, 3, 5, 10), outperforming existing state-of-the-art detection algorithms and validating the effectiveness and robustness of the proposed approach for few-shot sonar image detection tasks.

Overall, FS2-DETR provides a feasible and efficient solution for few-shot sonar image detection, significantly improving the recognition of weak and small objects while demonstrating robust generalization under few-shot conditions. Due to the limited availability of public sonar datasets, all experiments were conducted on the representative UATD dataset. In future work, we will explore the model’s transferability across different scenarios and the integration of multimodal information—such as acoustic texture and temporal features—and will conduct further experiments to evaluate performance across a broader range of classes and scenarios once suitable datasets become available.

Author Contributions

Conceptualization, S.Y. and X.Z.; methodology, S.Y.; software, S.Y.; validation, S.Y. and X.Z.; formal analysis, S.Y. and P.T.; investigation, S.Y.; resources, X.Z.; data curation, S.Y.; writing—original draft preparation, S.Y.; writing—review and editing, X.Z.; visualization, S.Y.; supervision, X.Z.; project administration, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, J.; Li, Y.; Qin, H.; Dai, P.; Zhao, Z.; Hu, M. Sonar image generation by MFA-CycleGAN for boosting underwater object detection of AUVs. IEEE J. Ocean. Eng. 2024, 49, 905–919. [Google Scholar] [CrossRef]
Shi, B.; Cao, T.; Ge, Q.; Lin, Y.; Wang, Z. Sonar image intelligent processing in seabed pipeline detection: Review and application. Meas. Sci. Technol. 2024, 35, 045405. [Google Scholar] [CrossRef]
Xi, Z.; Zhao, J.; Zhu, W. Side-scan sonar image simulation considering imaging mechanism and marine environment for zero-shot shipwreck detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4209713. [Google Scholar] [CrossRef]
Yang, Z.; Zhao, J.; Yu, Y.; Huang, C. A sample augmentation method for side-scan sonar full-class images that can be used for detection and segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5908111. [Google Scholar] [CrossRef]
Shi, P.; He, Q.; Zhu, S.; Li, X.; Fan, X.; Xin, Y. Multi-scale fusion and efficient feature extraction for enhanced sonar image object detection. Expert Syst. Appl. 2024, 256, 124958. [Google Scholar] [CrossRef]
Xi, J.; Ye, X.; Li, C. Sonar image target detection based on style transfer learning and random shape of noise under zero shot target. Remote Sens. 2022, 14, 6260. [Google Scholar] [CrossRef]
Li, L.; Li, Y.; Wang, H.; Yue, C.; Gao, P.; Wang, Y.; Feng, X. Side-scan sonar image generation under zero and few samples for underwater target detection. Remote Sens. 2024, 16, 4134. [Google Scholar] [CrossRef]
Hang, T.; Wu, W.; Feng, J.; Djigal, H.; Huang, J. A survey of Few-Shot Relation Extraction combining meta-learning with prompt learning. Neurocomputing 2025, 647, 130534. [Google Scholar] [CrossRef]
Billion Polak, P.; Prusa, J.D.; Khoshgoftaar, T.M. Low-shot learning and class imbalance: A survey. J. Big Data 2024, 11, 1. [Google Scholar] [CrossRef]
Köhler, M.; Eisenbach, M.; Gross, H.M. Few-shot object detection: A comprehensive survey. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 11958–11978. [Google Scholar] [CrossRef]
Zhang, J.; Liu, L.; Silven, O.; Pietikäinen, M.; Hu, D. Few-shot class-incremental learning for classification and object detection: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2924–2945. [Google Scholar] [CrossRef]
Han, G.; Huang, S.; Ma, J.; He, Y.; Chang, S.F. Meta faster r-cnn: Towards accurate few-shot object detection with attentive feature alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, February 28–1 March 2022; Volume 36, pp. 780–789. [Google Scholar]
Sun, B.; Li, B.; Cai, S.; Yuan, Y.; Zhang, C. Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7352–7362. [Google Scholar]
Qiao, L.; Zhao, Y.; Li, Z.; Qiu, X.; Wu, J.; Zhang, C. Defrcn: Decoupled faster r-cnn for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8681–8690. [Google Scholar]
Yang, Z.; Guan, W.; Xiao, L.; Chen, H. Few-shot object detection in remote sensing images via data clearing and stationary meta-learning. Sensors 2024, 24, 3882. [Google Scholar] [CrossRef]
Zhang, G.; Luo, Z.; Cui, K.; Lu, S.; Xing, E.P. Meta-DETR: Image-level few-shot detection with inter-class correlation exploitation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12832–12843. [Google Scholar] [CrossRef] [PubMed]
Bulat, A.; Guerrero, R.; Martinez, B.; Tzimiropoulos, G. Fs-detr: Few-shot detection transformer with prompting and without re-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 11793–11802. [Google Scholar]
Sivachandra, K.; Kumudham, R. A review: Object detection and classification using side scan sonar images via deep learning techniques. In Modern Approaches in Machine Learning and Cognitive Science: A Walkthrough: Volume 4; Springer: Cham, Switzerland, 2024; pp. 229–249. [Google Scholar]
Aubard, M.; Madureira, A.; Teixeira, L.; Pinto, J. Sonar-Based Deep Learning in Underwater Robotics: Overview, Robustness, and Challenges. IEEE J. Ocean. Eng. 2025, 50, 1866–1884. [Google Scholar] [CrossRef]
Jian, M.; Yang, N.; Tao, C.; Zhi, H.; Luo, H. Underwater object detection and datasets: A survey. Intell. Mar. Technol. Syst. 2024, 2, 9. [Google Scholar] [CrossRef]
Zhang, H.; Tian, M.; Shao, G.; Cheng, J.; Liu, J. Target detection of forward-looking sonar image based on improved YOLOv5. IEEE Access 2022, 10, 18023–18034. [Google Scholar] [CrossRef]
Wang, Z.; Guo, J.; Zeng, L.; Zhang, C.; Wang, B. MLFFNet: Multilevel feature fusion network for object detection in sonar images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5119119. [Google Scholar] [CrossRef]
Zhao, Z.; Wang, Z.; Wang, B.; Guo, J. RMFENet: Refined multiscale feature enhancement network for arbitrary-oriented sonar object detection. IEEE Sens. J. 2023, 23, 29211–29226. [Google Scholar] [CrossRef]
Palomeras, N.; Furfaro, T.; Williams, D.P.; Carreras, M.; Dugelay, S. Automatic target recognition for mine countermeasure missions using forward-looking sonar data. IEEE J. Ocean. Eng. 2021, 47, 141–161. [Google Scholar] [CrossRef]
Ghavidel, M.; Azhdari, S.M.H.; Khishe, M.; Kazemirad, M. Sonar data classification by using few-shot learning and concept extraction. Appl. Acoust. 2022, 195, 108856. [Google Scholar] [CrossRef]
Preciado-Grijalva, A.; Wehbe, B.; Firvida, M.B.; Valdenegro-Toro, M. Self-supervised learning for sonar image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1499–1508. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Shehzadi, T.; Hashmi, K.A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Object detection with transformers: A review. Sensors 2025, 25, 6025. [Google Scholar] [CrossRef]
Chen, Q.; Chen, X.; Wang, J.; Zhang, S.; Yao, K.; Feng, H.; Han, J.; Ding, E.; Zeng, G.; Wang, J. Group detr: Fast detr training with group-wise one-to-many assignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6633–6642. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; Volume 36, pp. 2567–2575. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient detr: Improving end-to-end object detector with dense prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv 2021, arXiv:2111.14330. [Google Scholar]
Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
Gharoun, H.; Momenifar, F.; Chen, F.; Gandomi, A.H. Meta-learning approaches for few-shot learning: A survey of recent advances. ACM Comput. Surv. 2024, 56, 1–41. [Google Scholar] [CrossRef]
Xin, Z.; Chen, S.; Wu, T.; Shao, Y.; Ding, W.; You, X. Few-shot object detection: Research advances and challenges. Inf. Fusion 2024, 107, 102307. [Google Scholar] [CrossRef]
Madan, A.; Peri, N.; Kong, S.; Ramanan, D. Revisiting few-shot object detection with vision-language models. Adv. Neural Inf. Process. Syst. 2024, 37, 19547–19560. [Google Scholar]
Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. arXiv 2020, arXiv:2003.06957. [Google Scholar] [CrossRef]
Gidaris, S.; Komodakis, N. Generating classification weights with gnn denoising autoencoders for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 21–30. [Google Scholar]
Li, A.; Li, Z. Transformation invariant few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3094–3102. [Google Scholar]
Chen, H.; Wang, Y.; Wang, G.; Qiao, Y. Lstd: A low-shot transfer detector for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Wu, J.; Liu, S.; Huang, D.; Wang, Y. Multi-scale positive sample refinement for few-shot object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 456–472. [Google Scholar]
Fan, Z.; Ma, Y.; Li, Z.; Sun, J. Generalized few-shot object detection without forgetting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4527–4536. [Google Scholar]
Li, B.; Wang, C.; Reddy, P.; Kim, S.; Scherer, S. Airdet: Few-shot detection without fine-tuning for autonomous exploration. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 427–444. [Google Scholar]
Dong, N.; Zhang, Y.; Ding, M.; Lee, G.H. Incremental-detr: Incremental few-shot object detection via self-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 543–551. [Google Scholar]
Xie, K.; Yang, J.; Qiu, K. A dataset with multibeam forward-looking sonar for underwater object detection. Sci. Data 2022, 9, 739. [Google Scholar] [CrossRef] [PubMed]
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 9912–9924. [Google Scholar]
Metaxas, I.M.; Bulat, A.; Patras, I.; Martinez, B.; Tzimiropoulos, G. Aligned Unsupervised Pretraining of Object Detectors with Self-training. arXiv 2023, arXiv:2307.15697. [Google Scholar]
Han, G.; Ma, J.; Huang, S.; Chen, L.; Chang, S.F. Few-shot object detection with fully cross-transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5321–5330. [Google Scholar]
Liu, Y.; Zhang, G.; Li, X. Hint-DETR: A Transfer Learning Model Based On DETR For Few-Shot Defect Detection. IEEE Trans. Instrum. Meas. 2025, 74, 5033711. [Google Scholar] [CrossRef]

Figure 1. Overview of the classical DETR architecture, including the CNN backbone, transformer encoder–decoder, and bipartite matching-based prediction mechanism.

Figure 2. Overall architecture of the proposed FS2-DETR framework.

Figure 3. Schematic illustration of the decoder prediction–guided feature enhancement and compensation in FS2-DETR.

Figure 4. Structural details of the DPGFRM in FS2-DETR.

Figure 5. Example images from the UATD dataset with ground-truth annotations.

Figure 6. Qualitative comparison of detection results produced by different detectors on the UATD dataset.

Figure 7. Visualization of enhanced feature sampling positions projected onto the original image. The sampling positions are extracted from the multi-scale feature maps produced by the backbone at different decoder stages during training and are mapped back to the input image for visualization. The left three columns show the projected resampling regions of enhanced features at different epochs, while the rightmost column presents the corresponding ground-truth annotations.

Figure 8. Visualization of regions attended by visual prompts during the cross-attention interaction with memory features. The bounding boxes indicate the ground-truth annotations.

Table 1. Three different base/novel class splits for evaluating FS2-DETR on the UATD dataset.

Configurations	Base Classes	Noval Classes
1	cube, ball, human body, distressed airplanes, square cage, metal bucket, tyre	cylinder, bluerov, circular cage
2	cube, cylinder, human body, circular cage, square cage, tyre, bluerov	ball, metal bucket, distressed airplanes
3	cylinder, distressed airplanes, circular cage, square cage, metal bucket, bluerov	cube, tyre, human body

Table 2. Training configurations for the two-stage few-shot learning of FS2-DETR.

Training Stages	Pre-Training Stage	Fine-Tuning Stage
Parameter Optimization Module	All	Projection, Classifier, Query Projection, DPGFRM
Optimizer	AdamW	AdamW
Epochs	50	24
Batch Size	4	2
Initial Learning Rate	$2 \times 10^{- 4}$	$1 \times 10^{- 4}$
Weight Decay	$1 \times 10^{- 5}$	$1 \times 10^{- 4}$
Hyperparameter n	5	5

Table 3. Performance comparison of FS2-DETR and other methods in terms of

{mAP}_{50}

under Split 1 on the UATD dataset.

Table 3. Performance comparison of FS2-DETR and other methods in terms of

{mAP}_{50}

under Split 1 on the UATD dataset.

Methods	1-Shot	3-Shot	5-Shot	10-Shot
Baseline [27]	27.6 ± 0.5	28.4 ± 0.2	29.6 ± 0.4	31.4 ± 0.2
B-FSDet [15]	25.3 ± 0.4	27.6 ± 0.4	28.4 ± 0.2	30.2 ± 0.3
DeFRCN [14]	28.8 ± 0.3	30.6 ± 0.3	32.4 ± 0.2	34.7 ± 0.4
Meta Faster R-CNN [12]	30.5 ± 0.2	33.2 ± 0.6	36.9 ± 0.3	39.8 ± 0.1
FSCE [13]	27.5 ± 0.5	29.4 ± 0.3	32.5 ± 0.4	33.7 ± 0.4
TIP [45]	25.8 ± 0.2	27.2 ± 0.5	28.3 ± 0.3	29.7 ± 0.4
AirDet [49]	31.5 ± 0.2	34.3 ± 0.4	36.8 ± 0.2	39.0 ± 0.3
FCT [54]	28.4 ± 0.4	31.6 ± 0.5	33.0 ± 0.2	35.9 ± 0.5
Meta-DETR [16]	32.5 ± 0.5	35.8 ± 0.1	37.5 ± 0.3	40.7 ± 0.2
Hint-DETR [55]	34.5 ± 0.2	37.4 ± 0.4	39.7 ± 0.3	41.3 ± 0.3
Ours	35.2 ± 0.4	38.7 ± 0.4	40.3 ± 0.2	43.3 ± 0.2

Table 4. Performance comparison of FS2-DETR and other methods in terms of

{mAP}_{50}

under Split 2 on the UATD dataset.

Table 4. Performance comparison of FS2-DETR and other methods in terms of

{mAP}_{50}

under Split 2 on the UATD dataset.

Methods	1-Shot	3-Shot	5-Shot	10-Shot
Baseline [27]	28.1 ± 0.8	29.0 ± 0.4	30.4 ± 0.3	31.6 ± 0.3
B-FSDet [15]	26.6 ± 0.6	28.5 ± 0.2	29.3 ± 0.5	31.4 ± 0.2
DeFRCN [14]	28.4 ± 0.3	29.7 ± 0.5	32.4 ± 0.4	35.9 ± 0.2
Meta Faster R-CNN [12]	33.5 ± 0.4	35.9 ± 0.5	38.1 ± 0.2	42.1 ± 0.1
FSCE [13]	28.3 ± 0.5	29.6 ± 0.4	31.4 ± 0.3	34.5 ± 0.4
TIP [45]	26.4 ± 0.2	28.2 ± 0.3	29.5 ± 0.6	32.3 ± 0.3
AirDet [49]	34.4 ± 0.3	36.7 ± 0.2	38.5 ± 0.2	42.3 ± 0.3
FCT [54]	30.3 ± 0.2	33.4 ± 0.5	35.6 ± 0.4	39.9 ± 0.5
Meta-DETR [16]	36.4 ± 0.2	38.6 ± 0.1	41.5 ± 0.7	44.1 ± 0.2
Hint-DETR [55]	37.5 ± 0.5	40.3 ± 0.4	42.3 ± 0.2	44.5 ± 0.3
Ours	38.9 ± 0.3	41.2 ± 0.5	42.2 ± 0.5	45.0 ± 0.2

Table 5. Performance comparison of FS2-DETR and other methods in terms of

{mAP}_{50}

under Split 3 on the UATD dataset.

Table 5. Performance comparison of FS2-DETR and other methods in terms of

{mAP}_{50}

under Split 3 on the UATD dataset.

Methods	1-Shot	3-Shot	5-Shot	10-Shot
Baseline [27]	25.3 ± 0.5	26.4 ± 0.3	28.5 ± 0.2	30.1 ± 0.3
B-FSDet [15]	23.7 ± 0.2	25.4 ± 0.1	27.5 ± 0.5	29.7 ± 0.5
DeFRCN [14]	26.7 ± 0.5	27.5 ± 0.2	30.4 ± 0.4	33.2 ± 0.2
Meta Faster R-CNN [12]	30.2 ± 0.6	31.4 ± 0.1	34.3 ± 0.2	39.4 ± 0.3
FSCE [13]	26.5 ± 0.4	29.2 ± 0.5	30.3 ± 0.3	32.6 ± 0.6
TIP [45]	24.5 ± 0.3	26.9 ± 0.7	28.5 ± 0.1	31.4 ± 0.4
AirDet [49]	32.3 ± 0.2	33.4 ± 0.5	35.3 ± 0.4	39.5 ± 0.2
FCT [54]	27.3 ± 0.4	28.8 ± 0.7	31.4 ± 0.2	34.7 ± 0.5
Meta-DETR [16]	32.4 ± 0.4	33.3 ± 0.3	35.4 ± 0.5	41.6 ± 0.2
Hint-DETR [55]	32.3 ± 0.2	34.4 ± 0.6	38.2 ± 0.4	42.5 ± 0.1
Ours	33.2 ± 0.2	36.4 ± 0.1	39.6 ± 0.2	42.9 ± 0.2

Table 6. Performance comparison of FS2-DETR and other methods in terms of mAP under Split 1 on the UATD dataset.

Methods	1-Shot	3-Shot	5-Shot	10-Shot
Baseline [27]	9.6 ± 0.4	10.2 ± 0.3	11.8 ± 0.3	13.7 ± 0.5
B-FSDet [15]	9.5 ± 0.3	10.4 ± 0.5	12.1 ± 0.3	14.7 ± 0.6
DeFRCN [14]	10.6 ± 0.6	12.3 ± 0.4	14.5 ± 0.5	16.8 ± 0.2
Meta Faster R-CNN [12]	13.5 ± 0.3	14.1 ± 0.5	16.8 ± 0.7	18.9 ± 0.2
FSCE [13]	11.4 ± 0.4	13.0 ± 0.2	14.2 ± 0.4	15.1 ± 0.3
TIP [45]	10.9 ± 0.4	11.5 ± 0.5	13.7 ± 0.4	14.7 ± 0.2
AirDet [49]	13.7 ± 0.5	14.5 ± 0.3	16.0 ± 0.5	18.2 ± 0.1
FCT [54]	11.6 ± 0.2	12.3 ± 0.5	14.6 ± 0.4	16.5 ± 0.4
Meta-DETR [16]	14.2 ± 0.4	15.8 ± 0.5	17.6 ± 0.3	19.3 ± 0.1
Hint-DETR [55]	15.5 ± 0.5	16.7 ± 0.2	18.3 ± 0.2	19.6 ± 0.4
Ours	15.4 ± 0.2	17.6 ± 0.3	19.3 ± 0.2	21.0 ± 0.5

Table 7. Performance comparison of FS2-DETR and other methods in terms of mAP under Split 1 on the UATD dataset.

Methods	1-Shot	3-Shot	5-Shot	10-Shot
Baseline [27]	10.4 ± 0.3	11.7 ± 0.5	12.9 ± 0.6	15.0 ± 0.2
B-FSDet [15]	10.2 ± 0.2	11.3 ± 0.3	13.6 ± 0.2	16.2 ± 0.5
DeFRCN [14]	11.4 ± 0.5	13.2 ± 0.2	15.9 ± 0.5	18.6 ± 0.4
Meta Faster R-CNN [12]	15.0 ± 0.5	16.2 ± 0.1	18.3 ± 0.2	20.1 ± 0.3
FSCE [13]	12.5 ± 0.4	14.2 ± 0.5	15.7 ± 0.2	16.9 ± 0.3
TIP [45]	11.3 ± 0.2	12.5 ± 0.4	14.7 ± 0.6	16.3 ± 0.5
AirDet [49]	14.6 ± 0.5	16.0 ± 0.4	17.2 ± 0.4	19.6 ± 0.1
FCT [54]	12.0 ± 0.3	13.2 ± 0.3	15.0 ± 0.6	17.3 ± 0.5
Meta-DETR [16]	15.3 ± 0.4	16.5 ± 0.3	18.8 ± 0.5	20.4 ± 0.2
Hint-DETR [55]	15.5 ± 0.5	17.0 ± 0.3	19.2 ± 0.6	21.0 ± 0.2
Ours	16.9 ± 0.1	18.3 ± 0.4	19.9 ± 0.2	21.8 ± 0.5

Table 8. Performance comparison of FS2-DETR and other methods in terms of mAP under Split 1 on the UATD dataset.

Methods	1-Shot	3-Shot	5-Shot	10-Shot
Baseline [27]	9.1 ± 0.3	9.9 ± 0.4	10.7 ± 0.4	12.7 ± 0.3
B-FSDet [15]	8.7 ± 0.3	9.9 ± 0.5	11.5 ± 0.6	13.8 ± 0.2
DeFRCN [14]	9.4 ± 0.5	10.7 ± 0.2	11.9 ± 0.5	14.5 ± 0.1
Meta Faster R-CNN [12]	12.8 ± 0.1	13.8 ± 0.4	15.2 ± 0.5	17.9 ± 0.3
FSCE [13]	10.5 ± 0.7	11.6 ± 0.4	13.0 ± 0.4	14.4 ± 0.5
TIP [45]	9.6 ± 0.2	10.5 ± 0.2	12.3 ± 0.4	13.9 ± 0.3
AirDet [49]	12.5 ± 0.3	13.9 ± 0.3	15.3 ± 0.4	18.1 ± 0.2
FCT [54]	10.1 ± 0.5	11.3 ± 0.2	13.2 ± 0.2	15.8 ± 0.3
Meta-DETR [16]	13.6 ± 0.2	14.5 ± 0.6	16.3 ± 0.3	18.9 ± 0.6
Hint-DETR [55]	14.2 ± 0.6	15.6 ± 0.1	17.0 ± 0.4	18.5 ± 0.3
Ours	14.1 ± 0.5	15.4 ± 0.6	17.2 ± 0.1	19.3 ± 0.3

Table 9. Ablation study results showing the performance impact of each proposed module on FS2-DETR, measured in terms of

{mAP}_{50}

.

Table 9. Ablation study results showing the performance impact of each proposed module on FS2-DETR, measured in terms of

{mAP}_{50}

.

Strategies	1-Shot	3-Shot	5-Shot	10-Shot
Baseline	27.8	28.5	29.9	31.3
Baseline + Memory Feature Enhancement Compensation Mechanism	31.7	33.4	36.0	39.5
Baseline +Visual Prompt Enhancement Mechanism	30.9	32.8	36.2	39.1
Baseline + Multi-stage Training Strategy	29.2	30.6	33.7	36.9
Baseline + All	35.3	38.6	40.5	43.2

Table 10. Ablation results of different layer-freezing schemes on FS2-DETR under Split 1 (

{mAP}_{50}

).

Table 10. Ablation results of different layer-freezing schemes on FS2-DETR under Split 1 (

{mAP}_{50}

).

Freezen Moudles	1-Shot	3-Shot	5-Shot	10-Shot
No Freezing	32.6	34.3	37.8	41.5
Backbone	33.1	35.7	38.8	42.1
Backbone+Encoder+PAN-Fusion+Decoder	34.9	37.6	39.2	42.7
Projection+Query Projection+Classifier+DPGFRM	25.6	26.2	28.5	30.1
Ours	35.3	38.6	40.5	43.2

Table 11. Effect of the Number of Visual Prompts Using Positive and Negative Prompts.

Number of Prompts (n)	Setting	${mAP}_{50}$
3	1-shot	34.7 ± 0.4
	3-shot	36.2 ± 0.2
	5-shot	39.1 ± 0.3
	10-shot	41.6 ± 0.5
5	1-shot	35.2 ± 0.4
	3-shot	38.7 ± 0.4
	5-shot	40.3 ± 0.2
	10-shot	43.3 ± 0.2
10	1-shot	36.0 ± 0.5
	3-shot	39.2 ± 0.5
	5-shot	40.5 ± 0.3
	10-shot	43.6 ± 0.4
15	1-shot	36.4 ± 0.2
	3-shot	39.0 ± 0.6
	5-shot	40.7 ± 0.3
	10-shot	43.5 ± 0.2

Table 12. Effect of the Number of Visual Prompts Using Positive Prompts Only.

Number of Prompts (n)	Setting	${mAP}_{50}$
3	1-shot	33.8 ± 0.3
	3-shot	35.2 ± 0.4
	5-shot	38.9 ± 0.2
	10-shot	40.1 ± 0.5
5	1-shot	34.3 ± 0.5
	3-shot	37.0 ± 0.6
	5-shot	39.5 ± 0.2
	10-shot	42.4 ± 0.4
10	1-shot	35.1 ± 0.3
	3-shot	38.0 ± 0.4
	5-shot	39.6 ± 0.6
	10-shot	42.4 ± 0.2
15	1-shot	35.3 ± 0.3
	3-shot	37.2 ± 0.5
	5-shot	39.5 ± 0.3
	10-shot	42.2 ± 0.4

Table 13. Ablation study results showing the performance impact of each proposed module on FS2-DETR.

Methods	GFLOPS	Params (M)	FPS
RT-DETR	64	32	63
Hint-DETR	89	88	24
Meta-DETR	763	52	16
FS2-DETR(Ours)	102	50	43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, S.; Zhang, X.; Tan, P. FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception. J. Mar. Sci. Eng. 2026, 14, 304. https://doi.org/10.3390/jmse14030304

AMA Style

Yang S, Zhang X, Tan P. FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception. Journal of Marine Science and Engineering. 2026; 14(3):304. https://doi.org/10.3390/jmse14030304

Chicago/Turabian Style

Yang, Shibo, Xiaoyu Zhang, and Panlong Tan. 2026. "FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception" Journal of Marine Science and Engineering 14, no. 3: 304. https://doi.org/10.3390/jmse14030304

APA Style

Yang, S., Zhang, X., & Tan, P. (2026). FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception. Journal of Marine Science and Engineering, 14(3), 304. https://doi.org/10.3390/jmse14030304

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception

Abstract

1. Introduction

2. Related Work

2.1. DETR

2.2. Few-Shot Object Detection

3. Method

3.1. Preliminary

3.2. Memory Feature Enhancement Compensation Mechanism

3.3. Visual Prompt Enhancement Mechanism

3.4. Multi-Stage Training Strategy

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Analysis of Experimental Results

4.4. Ablation Studies

4.4.1. Ablation on the Proposed Enhancement Modules

4.4.2. Impact of Freezing Different Network Modules

4.4.3. Sensitivity Analysis of the Hyperparameter n

4.5. Computational Complexity and Efficiency Analysis

4.6. Visualization Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI