FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception
Abstract
1. Introduction
- We introduce FS2-DETR, a DETR-based framework for few-shot object detection in sonar images, which explicitly addresses the dual challenges of data scarcity and degraded object features. By integrating tailored architectural designs with optimized training strategies, the proposed framework achieves more stable and generalizable detection performance under extremely limited supervision.
- A feature enhancement compensation mechanism is proposed to reinforce key feature learning. By using the dedicated DPGFRM, the predictions of the decoder are employed to augment the encoder memory. This design effectively captures latent semantic cues from limited samples and alleviates insufficient feature representation in few-shot classes.
- A visual prompt enhancement mechanism is proposed to improve feature interaction and representation learning. By using optimized visual prompts, object queries and encoder memory are jointly strengthened, effectively highlighting salient regions and key features in sonar images. This facilitates more efficient capture of semantic distinctions among few-shot objects, enhancing the detector’s sensitivity and inter-class discrimination.
- A multi-stage training strategy is proposed to address class confusion in few-shot object detection. By progressively reinforcing class recognition across successive training phases, this approach effectively reduces misclassifications and enhances the model’s robustness under few-shot conditions.
2. Related Work
2.1. DETR
2.2. Few-Shot Object Detection
3. Method
3.1. Preliminary
3.2. Memory Feature Enhancement Compensation Mechanism
3.3. Visual Prompt Enhancement Mechanism
3.4. Multi-Stage Training Strategy
4. Experiments
4.1. Dataset
4.2. Implementation Details
4.3. Analysis of Experimental Results
4.4. Ablation Studies
4.4.1. Ablation on the Proposed Enhancement Modules
4.4.2. Impact of Freezing Different Network Modules
4.4.3. Sensitivity Analysis of the Hyperparameter n
4.5. Computational Complexity and Efficiency Analysis
4.6. Visualization Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhou, J.; Li, Y.; Qin, H.; Dai, P.; Zhao, Z.; Hu, M. Sonar image generation by MFA-CycleGAN for boosting underwater object detection of AUVs. IEEE J. Ocean. Eng. 2024, 49, 905–919. [Google Scholar] [CrossRef]
- Shi, B.; Cao, T.; Ge, Q.; Lin, Y.; Wang, Z. Sonar image intelligent processing in seabed pipeline detection: Review and application. Meas. Sci. Technol. 2024, 35, 045405. [Google Scholar] [CrossRef]
- Xi, Z.; Zhao, J.; Zhu, W. Side-scan sonar image simulation considering imaging mechanism and marine environment for zero-shot shipwreck detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4209713. [Google Scholar] [CrossRef]
- Yang, Z.; Zhao, J.; Yu, Y.; Huang, C. A sample augmentation method for side-scan sonar full-class images that can be used for detection and segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5908111. [Google Scholar] [CrossRef]
- Shi, P.; He, Q.; Zhu, S.; Li, X.; Fan, X.; Xin, Y. Multi-scale fusion and efficient feature extraction for enhanced sonar image object detection. Expert Syst. Appl. 2024, 256, 124958. [Google Scholar] [CrossRef]
- Xi, J.; Ye, X.; Li, C. Sonar image target detection based on style transfer learning and random shape of noise under zero shot target. Remote Sens. 2022, 14, 6260. [Google Scholar] [CrossRef]
- Li, L.; Li, Y.; Wang, H.; Yue, C.; Gao, P.; Wang, Y.; Feng, X. Side-scan sonar image generation under zero and few samples for underwater target detection. Remote Sens. 2024, 16, 4134. [Google Scholar] [CrossRef]
- Hang, T.; Wu, W.; Feng, J.; Djigal, H.; Huang, J. A survey of Few-Shot Relation Extraction combining meta-learning with prompt learning. Neurocomputing 2025, 647, 130534. [Google Scholar] [CrossRef]
- Billion Polak, P.; Prusa, J.D.; Khoshgoftaar, T.M. Low-shot learning and class imbalance: A survey. J. Big Data 2024, 11, 1. [Google Scholar] [CrossRef]
- Köhler, M.; Eisenbach, M.; Gross, H.M. Few-shot object detection: A comprehensive survey. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 11958–11978. [Google Scholar] [CrossRef]
- Zhang, J.; Liu, L.; Silven, O.; Pietikäinen, M.; Hu, D. Few-shot class-incremental learning for classification and object detection: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2924–2945. [Google Scholar] [CrossRef]
- Han, G.; Huang, S.; Ma, J.; He, Y.; Chang, S.F. Meta faster r-cnn: Towards accurate few-shot object detection with attentive feature alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, February 28–1 March 2022; Volume 36, pp. 780–789. [Google Scholar]
- Sun, B.; Li, B.; Cai, S.; Yuan, Y.; Zhang, C. Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7352–7362. [Google Scholar]
- Qiao, L.; Zhao, Y.; Li, Z.; Qiu, X.; Wu, J.; Zhang, C. Defrcn: Decoupled faster r-cnn for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8681–8690. [Google Scholar]
- Yang, Z.; Guan, W.; Xiao, L.; Chen, H. Few-shot object detection in remote sensing images via data clearing and stationary meta-learning. Sensors 2024, 24, 3882. [Google Scholar] [CrossRef]
- Zhang, G.; Luo, Z.; Cui, K.; Lu, S.; Xing, E.P. Meta-DETR: Image-level few-shot detection with inter-class correlation exploitation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12832–12843. [Google Scholar] [CrossRef] [PubMed]
- Bulat, A.; Guerrero, R.; Martinez, B.; Tzimiropoulos, G. Fs-detr: Few-shot detection transformer with prompting and without re-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 11793–11802. [Google Scholar]
- Sivachandra, K.; Kumudham, R. A review: Object detection and classification using side scan sonar images via deep learning techniques. In Modern Approaches in Machine Learning and Cognitive Science: A Walkthrough: Volume 4; Springer: Cham, Switzerland, 2024; pp. 229–249. [Google Scholar]
- Aubard, M.; Madureira, A.; Teixeira, L.; Pinto, J. Sonar-Based Deep Learning in Underwater Robotics: Overview, Robustness, and Challenges. IEEE J. Ocean. Eng. 2025, 50, 1866–1884. [Google Scholar] [CrossRef]
- Jian, M.; Yang, N.; Tao, C.; Zhi, H.; Luo, H. Underwater object detection and datasets: A survey. Intell. Mar. Technol. Syst. 2024, 2, 9. [Google Scholar] [CrossRef]
- Zhang, H.; Tian, M.; Shao, G.; Cheng, J.; Liu, J. Target detection of forward-looking sonar image based on improved YOLOv5. IEEE Access 2022, 10, 18023–18034. [Google Scholar] [CrossRef]
- Wang, Z.; Guo, J.; Zeng, L.; Zhang, C.; Wang, B. MLFFNet: Multilevel feature fusion network for object detection in sonar images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5119119. [Google Scholar] [CrossRef]
- Zhao, Z.; Wang, Z.; Wang, B.; Guo, J. RMFENet: Refined multiscale feature enhancement network for arbitrary-oriented sonar object detection. IEEE Sens. J. 2023, 23, 29211–29226. [Google Scholar] [CrossRef]
- Palomeras, N.; Furfaro, T.; Williams, D.P.; Carreras, M.; Dugelay, S. Automatic target recognition for mine countermeasure missions using forward-looking sonar data. IEEE J. Ocean. Eng. 2021, 47, 141–161. [Google Scholar] [CrossRef]
- Ghavidel, M.; Azhdari, S.M.H.; Khishe, M.; Kazemirad, M. Sonar data classification by using few-shot learning and concept extraction. Appl. Acoust. 2022, 195, 108856. [Google Scholar] [CrossRef]
- Preciado-Grijalva, A.; Wehbe, B.; Firvida, M.B.; Valdenegro-Toro, M. Self-supervised learning for sonar image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1499–1508. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Shehzadi, T.; Hashmi, K.A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Object detection with transformers: A review. Sensors 2025, 25, 6025. [Google Scholar] [CrossRef]
- Chen, Q.; Chen, X.; Wang, J.; Zhang, S.; Yao, K.; Feng, H.; Han, J.; Ding, E.; Zeng, G.; Wang, J. Group detr: Fast detr training with group-wise one-to-many assignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6633–6642. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
- Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; Volume 36, pp. 2567–2575. [Google Scholar]
- Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
- Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
- Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient detr: Improving end-to-end object detector with dense prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
- Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv 2021, arXiv:2111.14330. [Google Scholar]
- Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
- Gharoun, H.; Momenifar, F.; Chen, F.; Gandomi, A.H. Meta-learning approaches for few-shot learning: A survey of recent advances. ACM Comput. Surv. 2024, 56, 1–41. [Google Scholar] [CrossRef]
- Xin, Z.; Chen, S.; Wu, T.; Shao, Y.; Ding, W.; You, X. Few-shot object detection: Research advances and challenges. Inf. Fusion 2024, 107, 102307. [Google Scholar] [CrossRef]
- Madan, A.; Peri, N.; Kong, S.; Ramanan, D. Revisiting few-shot object detection with vision-language models. Adv. Neural Inf. Process. Syst. 2024, 37, 19547–19560. [Google Scholar]
- Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. arXiv 2020, arXiv:2003.06957. [Google Scholar] [CrossRef]
- Gidaris, S.; Komodakis, N. Generating classification weights with gnn denoising autoencoders for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 21–30. [Google Scholar]
- Li, A.; Li, Z. Transformation invariant few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3094–3102. [Google Scholar]
- Chen, H.; Wang, Y.; Wang, G.; Qiao, Y. Lstd: A low-shot transfer detector for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Wu, J.; Liu, S.; Huang, D.; Wang, Y. Multi-scale positive sample refinement for few-shot object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 456–472. [Google Scholar]
- Fan, Z.; Ma, Y.; Li, Z.; Sun, J. Generalized few-shot object detection without forgetting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4527–4536. [Google Scholar]
- Li, B.; Wang, C.; Reddy, P.; Kim, S.; Scherer, S. Airdet: Few-shot detection without fine-tuning for autonomous exploration. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 427–444. [Google Scholar]
- Dong, N.; Zhang, Y.; Ding, M.; Lee, G.H. Incremental-detr: Incremental few-shot object detection via self-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 543–551. [Google Scholar]
- Xie, K.; Yang, J.; Qiu, K. A dataset with multibeam forward-looking sonar for underwater object detection. Sci. Data 2022, 9, 739. [Google Scholar] [CrossRef] [PubMed]
- Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 9912–9924. [Google Scholar]
- Metaxas, I.M.; Bulat, A.; Patras, I.; Martinez, B.; Tzimiropoulos, G. Aligned Unsupervised Pretraining of Object Detectors with Self-training. arXiv 2023, arXiv:2307.15697. [Google Scholar]
- Han, G.; Ma, J.; Huang, S.; Chen, L.; Chang, S.F. Few-shot object detection with fully cross-transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5321–5330. [Google Scholar]
- Liu, Y.; Zhang, G.; Li, X. Hint-DETR: A Transfer Learning Model Based On DETR For Few-Shot Defect Detection. IEEE Trans. Instrum. Meas. 2025, 74, 5033711. [Google Scholar] [CrossRef]








| Configurations | Base Classes | Noval Classes |
|---|---|---|
| 1 | cube, ball, human body, distressed airplanes, square cage, metal bucket, tyre | cylinder, bluerov, circular cage |
| 2 | cube, cylinder, human body, circular cage, square cage, tyre, bluerov | ball, metal bucket, distressed airplanes |
| 3 | cylinder, distressed airplanes, circular cage, square cage, metal bucket, bluerov | cube, tyre, human body |
| Training Stages | Pre-Training Stage | Fine-Tuning Stage |
|---|---|---|
| Parameter Optimization Module | All | Projection, Classifier, Query Projection, DPGFRM |
| Optimizer | AdamW | AdamW |
| Epochs | 50 | 24 |
| Batch Size | 4 | 2 |
| Initial Learning Rate | ||
| Weight Decay | ||
| Hyperparameter n | 5 | 5 |
| Methods | 1-Shot | 3-Shot | 5-Shot | 10-Shot |
|---|---|---|---|---|
| Baseline [27] | 27.6 ± 0.5 | 28.4 ± 0.2 | 29.6 ± 0.4 | 31.4 ± 0.2 |
| B-FSDet [15] | 25.3 ± 0.4 | 27.6 ± 0.4 | 28.4 ± 0.2 | 30.2 ± 0.3 |
| DeFRCN [14] | 28.8 ± 0.3 | 30.6 ± 0.3 | 32.4 ± 0.2 | 34.7 ± 0.4 |
| Meta Faster R-CNN [12] | 30.5 ± 0.2 | 33.2 ± 0.6 | 36.9 ± 0.3 | 39.8 ± 0.1 |
| FSCE [13] | 27.5 ± 0.5 | 29.4 ± 0.3 | 32.5 ± 0.4 | 33.7 ± 0.4 |
| TIP [45] | 25.8 ± 0.2 | 27.2 ± 0.5 | 28.3 ± 0.3 | 29.7 ± 0.4 |
| AirDet [49] | 31.5 ± 0.2 | 34.3 ± 0.4 | 36.8 ± 0.2 | 39.0 ± 0.3 |
| FCT [54] | 28.4 ± 0.4 | 31.6 ± 0.5 | 33.0 ± 0.2 | 35.9 ± 0.5 |
| Meta-DETR [16] | 32.5 ± 0.5 | 35.8 ± 0.1 | 37.5 ± 0.3 | 40.7 ± 0.2 |
| Hint-DETR [55] | 34.5 ± 0.2 | 37.4 ± 0.4 | 39.7 ± 0.3 | 41.3 ± 0.3 |
| Ours | 35.2 ± 0.4 | 38.7 ± 0.4 | 40.3 ± 0.2 | 43.3 ± 0.2 |
| Methods | 1-Shot | 3-Shot | 5-Shot | 10-Shot |
|---|---|---|---|---|
| Baseline [27] | 28.1 ± 0.8 | 29.0 ± 0.4 | 30.4 ± 0.3 | 31.6 ± 0.3 |
| B-FSDet [15] | 26.6 ± 0.6 | 28.5 ± 0.2 | 29.3 ± 0.5 | 31.4 ± 0.2 |
| DeFRCN [14] | 28.4 ± 0.3 | 29.7 ± 0.5 | 32.4 ± 0.4 | 35.9 ± 0.2 |
| Meta Faster R-CNN [12] | 33.5 ± 0.4 | 35.9 ± 0.5 | 38.1 ± 0.2 | 42.1 ± 0.1 |
| FSCE [13] | 28.3 ± 0.5 | 29.6 ± 0.4 | 31.4 ± 0.3 | 34.5 ± 0.4 |
| TIP [45] | 26.4 ± 0.2 | 28.2 ± 0.3 | 29.5 ± 0.6 | 32.3 ± 0.3 |
| AirDet [49] | 34.4 ± 0.3 | 36.7 ± 0.2 | 38.5 ± 0.2 | 42.3 ± 0.3 |
| FCT [54] | 30.3 ± 0.2 | 33.4 ± 0.5 | 35.6 ± 0.4 | 39.9 ± 0.5 |
| Meta-DETR [16] | 36.4 ± 0.2 | 38.6 ± 0.1 | 41.5 ± 0.7 | 44.1 ± 0.2 |
| Hint-DETR [55] | 37.5 ± 0.5 | 40.3 ± 0.4 | 42.3 ± 0.2 | 44.5 ± 0.3 |
| Ours | 38.9 ± 0.3 | 41.2 ± 0.5 | 42.2 ± 0.5 | 45.0 ± 0.2 |
| Methods | 1-Shot | 3-Shot | 5-Shot | 10-Shot |
|---|---|---|---|---|
| Baseline [27] | 25.3 ± 0.5 | 26.4 ± 0.3 | 28.5 ± 0.2 | 30.1 ± 0.3 |
| B-FSDet [15] | 23.7 ± 0.2 | 25.4 ± 0.1 | 27.5 ± 0.5 | 29.7 ± 0.5 |
| DeFRCN [14] | 26.7 ± 0.5 | 27.5 ± 0.2 | 30.4 ± 0.4 | 33.2 ± 0.2 |
| Meta Faster R-CNN [12] | 30.2 ± 0.6 | 31.4 ± 0.1 | 34.3 ± 0.2 | 39.4 ± 0.3 |
| FSCE [13] | 26.5 ± 0.4 | 29.2 ± 0.5 | 30.3 ± 0.3 | 32.6 ± 0.6 |
| TIP [45] | 24.5 ± 0.3 | 26.9 ± 0.7 | 28.5 ± 0.1 | 31.4 ± 0.4 |
| AirDet [49] | 32.3 ± 0.2 | 33.4 ± 0.5 | 35.3 ± 0.4 | 39.5 ± 0.2 |
| FCT [54] | 27.3 ± 0.4 | 28.8 ± 0.7 | 31.4 ± 0.2 | 34.7 ± 0.5 |
| Meta-DETR [16] | 32.4 ± 0.4 | 33.3 ± 0.3 | 35.4 ± 0.5 | 41.6 ± 0.2 |
| Hint-DETR [55] | 32.3 ± 0.2 | 34.4 ± 0.6 | 38.2 ± 0.4 | 42.5 ± 0.1 |
| Ours | 33.2 ± 0.2 | 36.4 ± 0.1 | 39.6 ± 0.2 | 42.9 ± 0.2 |
| Methods | 1-Shot | 3-Shot | 5-Shot | 10-Shot |
|---|---|---|---|---|
| Baseline [27] | 9.6 ± 0.4 | 10.2 ± 0.3 | 11.8 ± 0.3 | 13.7 ± 0.5 |
| B-FSDet [15] | 9.5 ± 0.3 | 10.4 ± 0.5 | 12.1 ± 0.3 | 14.7 ± 0.6 |
| DeFRCN [14] | 10.6 ± 0.6 | 12.3 ± 0.4 | 14.5 ± 0.5 | 16.8 ± 0.2 |
| Meta Faster R-CNN [12] | 13.5 ± 0.3 | 14.1 ± 0.5 | 16.8 ± 0.7 | 18.9 ± 0.2 |
| FSCE [13] | 11.4 ± 0.4 | 13.0 ± 0.2 | 14.2 ± 0.4 | 15.1 ± 0.3 |
| TIP [45] | 10.9 ± 0.4 | 11.5 ± 0.5 | 13.7 ± 0.4 | 14.7 ± 0.2 |
| AirDet [49] | 13.7 ± 0.5 | 14.5 ± 0.3 | 16.0 ± 0.5 | 18.2 ± 0.1 |
| FCT [54] | 11.6 ± 0.2 | 12.3 ± 0.5 | 14.6 ± 0.4 | 16.5 ± 0.4 |
| Meta-DETR [16] | 14.2 ± 0.4 | 15.8 ± 0.5 | 17.6 ± 0.3 | 19.3 ± 0.1 |
| Hint-DETR [55] | 15.5 ± 0.5 | 16.7 ± 0.2 | 18.3 ± 0.2 | 19.6 ± 0.4 |
| Ours | 15.4 ± 0.2 | 17.6 ± 0.3 | 19.3 ± 0.2 | 21.0 ± 0.5 |
| Methods | 1-Shot | 3-Shot | 5-Shot | 10-Shot |
|---|---|---|---|---|
| Baseline [27] | 10.4 ± 0.3 | 11.7 ± 0.5 | 12.9 ± 0.6 | 15.0 ± 0.2 |
| B-FSDet [15] | 10.2 ± 0.2 | 11.3 ± 0.3 | 13.6 ± 0.2 | 16.2 ± 0.5 |
| DeFRCN [14] | 11.4 ± 0.5 | 13.2 ± 0.2 | 15.9 ± 0.5 | 18.6 ± 0.4 |
| Meta Faster R-CNN [12] | 15.0 ± 0.5 | 16.2 ± 0.1 | 18.3 ± 0.2 | 20.1 ± 0.3 |
| FSCE [13] | 12.5 ± 0.4 | 14.2 ± 0.5 | 15.7 ± 0.2 | 16.9 ± 0.3 |
| TIP [45] | 11.3 ± 0.2 | 12.5 ± 0.4 | 14.7 ± 0.6 | 16.3 ± 0.5 |
| AirDet [49] | 14.6 ± 0.5 | 16.0 ± 0.4 | 17.2 ± 0.4 | 19.6 ± 0.1 |
| FCT [54] | 12.0 ± 0.3 | 13.2 ± 0.3 | 15.0 ± 0.6 | 17.3 ± 0.5 |
| Meta-DETR [16] | 15.3 ± 0.4 | 16.5 ± 0.3 | 18.8 ± 0.5 | 20.4 ± 0.2 |
| Hint-DETR [55] | 15.5 ± 0.5 | 17.0 ± 0.3 | 19.2 ± 0.6 | 21.0 ± 0.2 |
| Ours | 16.9 ± 0.1 | 18.3 ± 0.4 | 19.9 ± 0.2 | 21.8 ± 0.5 |
| Methods | 1-Shot | 3-Shot | 5-Shot | 10-Shot |
|---|---|---|---|---|
| Baseline [27] | 9.1 ± 0.3 | 9.9 ± 0.4 | 10.7 ± 0.4 | 12.7 ± 0.3 |
| B-FSDet [15] | 8.7 ± 0.3 | 9.9 ± 0.5 | 11.5 ± 0.6 | 13.8 ± 0.2 |
| DeFRCN [14] | 9.4 ± 0.5 | 10.7 ± 0.2 | 11.9 ± 0.5 | 14.5 ± 0.1 |
| Meta Faster R-CNN [12] | 12.8 ± 0.1 | 13.8 ± 0.4 | 15.2 ± 0.5 | 17.9 ± 0.3 |
| FSCE [13] | 10.5 ± 0.7 | 11.6 ± 0.4 | 13.0 ± 0.4 | 14.4 ± 0.5 |
| TIP [45] | 9.6 ± 0.2 | 10.5 ± 0.2 | 12.3 ± 0.4 | 13.9 ± 0.3 |
| AirDet [49] | 12.5 ± 0.3 | 13.9 ± 0.3 | 15.3 ± 0.4 | 18.1 ± 0.2 |
| FCT [54] | 10.1 ± 0.5 | 11.3 ± 0.2 | 13.2 ± 0.2 | 15.8 ± 0.3 |
| Meta-DETR [16] | 13.6 ± 0.2 | 14.5 ± 0.6 | 16.3 ± 0.3 | 18.9 ± 0.6 |
| Hint-DETR [55] | 14.2 ± 0.6 | 15.6 ± 0.1 | 17.0 ± 0.4 | 18.5 ± 0.3 |
| Ours | 14.1 ± 0.5 | 15.4 ± 0.6 | 17.2 ± 0.1 | 19.3 ± 0.3 |
| Strategies | 1-Shot | 3-Shot | 5-Shot | 10-Shot |
|---|---|---|---|---|
| Baseline | 27.8 | 28.5 | 29.9 | 31.3 |
| Baseline + Memory Feature Enhancement Compensation Mechanism | 31.7 | 33.4 | 36.0 | 39.5 |
| Baseline +Visual Prompt Enhancement Mechanism | 30.9 | 32.8 | 36.2 | 39.1 |
| Baseline + Multi-stage Training Strategy | 29.2 | 30.6 | 33.7 | 36.9 |
| Baseline + All | 35.3 | 38.6 | 40.5 | 43.2 |
| Freezen Moudles | 1-Shot | 3-Shot | 5-Shot | 10-Shot |
|---|---|---|---|---|
| No Freezing | 32.6 | 34.3 | 37.8 | 41.5 |
| Backbone | 33.1 | 35.7 | 38.8 | 42.1 |
| Backbone+Encoder+PAN-Fusion+Decoder | 34.9 | 37.6 | 39.2 | 42.7 |
| Projection+Query Projection+Classifier+DPGFRM | 25.6 | 26.2 | 28.5 | 30.1 |
| Ours | 35.3 | 38.6 | 40.5 | 43.2 |
| Number of Prompts (n) | Setting | |
|---|---|---|
| 3 | 1-shot | 34.7 ± 0.4 |
| 3-shot | 36.2 ± 0.2 | |
| 5-shot | 39.1 ± 0.3 | |
| 10-shot | 41.6 ± 0.5 | |
| 5 | 1-shot | 35.2 ± 0.4 |
| 3-shot | 38.7 ± 0.4 | |
| 5-shot | 40.3 ± 0.2 | |
| 10-shot | 43.3 ± 0.2 | |
| 10 | 1-shot | 36.0 ± 0.5 |
| 3-shot | 39.2 ± 0.5 | |
| 5-shot | 40.5 ± 0.3 | |
| 10-shot | 43.6 ± 0.4 | |
| 15 | 1-shot | 36.4 ± 0.2 |
| 3-shot | 39.0 ± 0.6 | |
| 5-shot | 40.7 ± 0.3 | |
| 10-shot | 43.5 ± 0.2 |
| Number of Prompts (n) | Setting | |
|---|---|---|
| 3 | 1-shot | 33.8 ± 0.3 |
| 3-shot | 35.2 ± 0.4 | |
| 5-shot | 38.9 ± 0.2 | |
| 10-shot | 40.1 ± 0.5 | |
| 5 | 1-shot | 34.3 ± 0.5 |
| 3-shot | 37.0 ± 0.6 | |
| 5-shot | 39.5 ± 0.2 | |
| 10-shot | 42.4 ± 0.4 | |
| 10 | 1-shot | 35.1 ± 0.3 |
| 3-shot | 38.0 ± 0.4 | |
| 5-shot | 39.6 ± 0.6 | |
| 10-shot | 42.4 ± 0.2 | |
| 15 | 1-shot | 35.3 ± 0.3 |
| 3-shot | 37.2 ± 0.5 | |
| 5-shot | 39.5 ± 0.3 | |
| 10-shot | 42.2 ± 0.4 |
| Methods | GFLOPS | Params (M) | FPS |
|---|---|---|---|
| RT-DETR | 64 | 32 | 63 |
| Hint-DETR | 89 | 88 | 24 |
| Meta-DETR | 763 | 52 | 16 |
| FS2-DETR(Ours) | 102 | 50 | 43 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yang, S.; Zhang, X.; Tan, P. FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception. J. Mar. Sci. Eng. 2026, 14, 304. https://doi.org/10.3390/jmse14030304
Yang S, Zhang X, Tan P. FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception. Journal of Marine Science and Engineering. 2026; 14(3):304. https://doi.org/10.3390/jmse14030304
Chicago/Turabian StyleYang, Shibo, Xiaoyu Zhang, and Panlong Tan. 2026. "FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception" Journal of Marine Science and Engineering 14, no. 3: 304. https://doi.org/10.3390/jmse14030304
APA StyleYang, S., Zhang, X., & Tan, P. (2026). FS2-DETR: Transformer-Based Few-Shot Sonar Object Detection with Enhanced Feature Perception. Journal of Marine Science and Engineering, 14(3), 304. https://doi.org/10.3390/jmse14030304

