LDFSAM: Localization Distillation-Enhanced Feature Prompting SAM for Medical Image Segmentation
Abstract
1. Introduction
- We propose the LDFSAM, a framework that replaces low-dimensional geometric prompts with localization-aware feature prompts injected into the latent space of a SAM-based decoder. The combination of Dense Feature Prompts (DFPs) and Sparse Feature Prompts (SFPs) encodes both where the object is and what it looks like, enabling more reliable segmentation of small, crowded, and highly deformable structures.
- A lightweight detector trained via localization distillation provides distilled multi-scale features to a dual-stream prompt encoder, improving prompt quality without increasing inference cost.
- We conduct extensive experiments across four public benchmarks and an additional private CBCT cohort. Ablation studies show that latent feature-level prompts, comprising a pixel-wise Dense Feature Prompt (DFP) and a small set of Sparse Feature Prompts (SFPs), substantially outperform geometric box prompts, and that combining DFPs and SFPs yields consistent gains. Across different annotation budgets, LDFSAM surpasses existing SAM-based baselines and achieves Dice scores competitive with or superior to strong task-specific architectures, while maintaining modest trainable overhead and strong cross-dataset generalization.
2. Relative Work
2.1. Task-Specific Segmentation Networks
2.2. Segment Anything and SAM-Based Medical Segmentation
2.3. From Detector-SAM Hybrids to LDFSAM
3. Method
3.1. Overview of LDFSAM
3.2. YOLO-Based Dense–Sparse Feature Prompt Encoder
3.2.1. Multi-Scale Feature Fusion
3.2.2. Dense and Sparse Feature Prompts
3.3. Localization Distillation on YOLO Features
3.4. Training Strategy
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Ablation on Dense–Sparse Feature Prompts
4.4. Effect of Localization Distillation on the Prompt Generator
4.5. Comparison with Other SAM-Based Methods
4.6. Comparison with Conventional Segmentation Networks
4.7. Cross-Center Generalization on Privatel CBCT Data
5. Discussion
5.1. From Geometric Prompts to Feature-Level Prompts
5.2. Localization Distillation
5.3. Data Efficiency and Cross-Domain Generalization
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar]
- Kebaili, A.; Lapuyade-Lahorgue, J.; Ruan, S. Deep learning approaches for data augmentation in medical imaging: A review. J. Imaging 2023, 9, 81. [Google Scholar] [CrossRef] [PubMed]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 4015–4026. [Google Scholar]
- Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef]
- Cheng, J.; Ye, J.; Deng, Z.; Chen, J.; Li, T.; Wang, H.; Su, Y.; Huang, Z.; Chen, J.; Jiang, L.; et al. SAM-Med2D. arXiv 2023, arXiv:2308.16184. [Google Scholar] [CrossRef]
- Rahman, M.M.; Munir, M.; Jha, D.; Bagci, U.; Marculescu, R. Pp-sam: Perturbed prompts for robust adaption of segment anything model for polyp segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 4989–4995. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2015; pp. 3431–3440. [Google Scholar]
- Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2016; pp. 424–432. [Google Scholar]
- Milletari, F.; Navab, N.; Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV); IEEE: New York, NY, USA, 2016; pp. 565–571. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 2961–2969. [Google Scholar]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 9157–9166. [Google Scholar]
- Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. Solov2: Dynamic and fast instance segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 17721–17732. [Google Scholar]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
- Wang, C.; Wang, L.; Wang, N.; Wei, X.; Feng, T.; Wu, M.; Yao, Q.; Zhang, R. CFATransUnet: Channel-wise cross fusion attention and transformer for 2D medical image segmentation. Comput. Biol. Med. 2024, 168, 107803. [Google Scholar] [CrossRef]
- Nikulins, A.; Edelmers, E.; Sudars, K.; Polaka, I. Adapting Classification Neural Network Architectures for Medical Image Segmentation Using Explainable AI. J. Imaging 2025, 11, 55. [Google Scholar] [CrossRef] [PubMed]
- Huang, Y.; Yang, X.; Liu, L.; Zhou, H.; Chang, A.; Zhou, X.; Chen, R.; Yu, J.; Chen, J.; Chen, C. Segment anything model for medical images? Med. Image Anal. 2024, 92, 103061. [Google Scholar]
- Ali, M.; Wu, T.; Hu, H.; Luo, Q.; Xu, D.; Zheng, W.; Jin, N.; Yang, C.; Yao, J. A review of the segment anything model (sam) for medical image analysis: Accomplishments and perspectives. Comput. Med. Imaging Graph. 2024, 119, 102473. [Google Scholar] [CrossRef]
- Zhang, K.; Liu, D. Customized segment anything model for medical image segmentation. arXiv 2023, arXiv:2304.13785. [Google Scholar] [CrossRef]
- Xu, Q.; Li, J.; He, X.; Liu, Z.; Chen, Z.; Duan, W.; Li, C.; He, M.M.; Tesema, F.B.; Cheah, W.P. Esp-medsam: Efficient self-prompting sam for universal domain-generalized medical image segmentation. arXiv 2024, arXiv:2407.14153. [Google Scholar]
- Yang, L.; Liu, P.; Zhang, G.; Zhao, H.; Zhao, C. Domain-Adaptive Segment Anything Model for Cross-Domain Water Body Segmentation in Satellite Imagery. J. Imaging 2025, 11, 437. [Google Scholar] [PubMed]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8; Ultralytics: Frederick, MD, USA, 2023. [Google Scholar]
- Gül, S.; Cetinel, G.; Aydin, B.M.; Akgün, D.; Öztaş Kara, R. YOLOSAMIC: A Hybrid Approach to Skin Cancer Segmentation with the Segment Anything Model and YOLOv8. Diagnostics 2025, 15, 479. [Google Scholar] [CrossRef] [PubMed]
- Pandey, S.; Chen, K.-F.; Dam, E.B. Comprehensive multimodal segmentation in medical imaging: Combining yolov8 with sam and hq-sam models. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 2592–2598. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
- Zheng, Z.; Ye, R.; Wang, P.; Ren, D.; Zuo, W.; Hou, Q.; Cheng, M.-M. Localization distillation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 9407–9416. [Google Scholar]
- Cui, Z.; Fang, Y.; Mei, L.; Zhang, B.; Yu, B.; Liu, J.; Jiang, C.; Sun, Y.; Ma, L.; Huang, J. A fully automatic AI system for tooth and alveolar bone segmentation from cone-beam CT images. Nat. Commun. 2022, 13, 2096. [Google Scholar] [CrossRef]
- Codella, N.; Rotemberg, V.; Tschandl, P.; Celebi, M.E.; Dusza, S.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv 2019, arXiv:1902.03368. [Google Scholar] [CrossRef]
- Zhao, Q.; Lyu, S.; Bai, W.; Cai, L.; Liu, B.; Cheng, G.; Wu, M.; Sang, X.; Yang, M.; Chen, L. MMOTU: A multi-modality ovarian tumor ultrasound image dataset for unsupervised cross-domain semantic segmentation. arXiv 2022, arXiv:2207.06799. [Google Scholar]
- Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Halvorsen, P.; De Lange, T.; Johansen, D.; Johansen, H.D. Kvasir-seg: A segmented polyp dataset. In Proceedings of the International Conference on Multimedia Modeling; Springer: Berlin/Heidelberg, Germany, 2019; pp. 451–462. [Google Scholar]
- Gibson, E.; Giganti, F.; Hu, Y.; Bonmati, E.; Bandula, S.; Gurusamy, K.; Davidson, B.; Pereira, S.P.; Clarkson, M.J.; Barratt, D.C. Automatic multi-organ segmentation on abdominal CT with dense V-networks. IEEE Trans. Med. Imaging 2018, 37, 1822–1834. [Google Scholar] [CrossRef]
- Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
- Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2022; pp. 574–584. [Google Scholar]
- Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.R.; Xu, D. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In Proceedings of the International MICCAI Brainlesion Workshop; Springer: Berlin/Heidelberg, Germany, 2021; pp. 272–284. [Google Scholar]
- Zhou, H.-Y.; Guo, J.; Zhang, Y.; Han, X.; Yu, L.; Wang, L.; Yu, Y. nnformer: Volumetric medical image segmentation via a 3d transformer. IEEE Trans. Image Process. 2023, 32, 4036–4045. [Google Scholar] [CrossRef]
- Lee, H.H.; Bao, S.; Huo, Y.; Landman, B.A. 3d ux-net: A large kernel volumetric convnet modernizing hierarchical transformer for medical image segmentation. arXiv 2022, arXiv:2209.15076. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Gu, R.; Wang, G.; Song, T.; Huang, R.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T.; Zhang, S. CA-Net: Comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 699–711. [Google Scholar] [CrossRef] [PubMed]
- Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef]
- Feng, S.; Zhao, H.; Shi, F.; Cheng, X.; Wang, M.; Ma, Y.; Xiang, D.; Zhu, W.; Chen, X. CPFNet: Context pyramid fusion network for medical image segmentation. IEEE Trans. Med. Imaging 2020, 39, 3008–3018. [Google Scholar] [CrossRef] [PubMed]
- Jin, Q.; Cui, H.; Sun, C.; Meng, Z.; Su, R. Cascade knowledge diffusion network for skin lesion diagnosis and segmentation. Appl. Soft Comput. 2021, 99, 106881. [Google Scholar] [CrossRef]







| Backbone | Method | 3D CBCT Tooth | ISIC 2018 | MMOTU | Kvasir-SEG | ||||
|---|---|---|---|---|---|---|---|---|---|
| IoU (%) ↑ | Dice (%) ↑ | IoU (%) ↑ | Dice (%) ↑ | IoU (%) ↑ | Dice(%) ↑ | IoU (%) ↑ | Dice (%) ↑ | ||
| YOLOv8n | Baseline | 78.91 | 87.36 | 83.55 | 90.28 | 79.34 | 87.88 | 81.67 | 89.54 |
| Feature-D | 81.05 | 88.92 | 85.66 | 91.83 | 82.29 | 89.72 | 85.01 | 91.06 | |
| Feature-S | 79.36 | 88.14 | 84.59 | 91.09 | 81.05 | 88.95 | 84.04 | 90.82 | |
| Feature-S+D | 81.89 | 89.65 | 86.65 | 92.46 | 83.13 | 90.30 | 85.16 | 91.57 | |
| YOLOv8x | Baseline | 81.64 | 89.30 | 85.62 | 91.86 | 83.01 | 90.23 | 84.50 | 91.10 |
| Feature-D | 84.78 | 91.05 | 87.90 | 93.26 | 86.15 | 92.21 | 87.44 | 92.81 | |
| Feature-S | 83.56 | 90.17 | 86.76 | 92.62 | 84.72 | 91.39 | 86.28 | 92.33 | |
| Feature-S+D | 85.98 | 91.93 | 89.05 | 93.95 | 87.11 | 92.76 | 87.62 | 93.12 | |
| Method | 3D CBCT Tooth | ISIC 2018 | MMOTU | Kvasir-SEG | ||||
|---|---|---|---|---|---|---|---|---|
| IoU (%) ↑ | Dice (%) ↑ | IoU (%) ↑ | Dice (%) ↑ | IoU (%) ↑ | Dice (%) ↑ | IoU (%) ↑ | Dice (%) ↑ | |
| w/o distillation | 82.03 | 89.65 | 86.94 | 92.46 | 83.16 | 90.30 | 85.53 | 91.57 |
| KD distillation | 85.25 | 91.13 | 88.65 | 93.47 | 86.04 | 91.91 | 87.42 | 92.77 |
| LD (Main) distillation | 85.69 | 91.64 | 88.98 | 93.59 | 86.10 | 92.05 | 87.82 | 92.91 |
| LD (Main+VLR) distillation | 85.81 | 91.79 | 89.27 | 93.71 | 86.56 | 92.21 | 88.10 | 93.04 |
| Method | All | Trainable | GPU-Hours@100% Masks | |||
|---|---|---|---|---|---|---|
| Params (M) | Params (M) | CBCT | ISIC2018 | MMOTU | Kvasir-SEG | |
| SAM [6] | 93.7 | 4.1 | 421.5 | 38.6 | 22.3 | 19.7 |
| MedSAM [7] | 93.7 | 4.1 | 410.4 | 30.8 | 17.2 | 15.3 |
| SAMed [21] | 93.9 | 4.2 | 216.7 | 11.9 | 6.4 | 6.0 |
| SAMMed2D [8] | 271.2 | 186.8 | 132.0 | 7.9 | 3.8 | 3.0 |
| LDFSAM | 274.5 | 190.1 | 134.1 | 8.1 | 3.9 | 3.2 |
| Method | IoU (%) ↑ | Dice (%) ↑ | HD (mm) ↓ | ASSD (mm) ↓ | SO (%) ↑ |
|---|---|---|---|---|---|
| UNet3D [11] | 68.00 | 79.52 | 113.78 | 25.50 | 67.09 |
| DenseVNet [33] | 84.57 | 91.15 | 8.21 | 1.14 | 94.88 |
| AttentionUNet3D [34] | 52.52 | 64.08 | 147.10 | 61.10 | 42.49 |
| UNETR [35] | 74.30 | 81.84 | 107.89 | 17.95 | 73.14 |
| SwinUNETR [36] | 83.10 | 89.74 | 82.71 | 7.50 | 86.80 |
| nnFormer [37] | 83.54 | 90.66 | 51.28 | 5.08 | 90.89 |
| 3D UX-Net [38] | 75.40 | 84.89 | 108.52 | 19.69 | 73.48 |
| nnU-Net [4] | 85.33 | 91.50 | 7.87 | 0.96 | 95.05 |
| SegFormer [39] | 85.06 | 91.37 | 9.54 | 1.22 | 93.47 |
| TransUNet [16] | 84.65 | 90.69 | 12.30 | 2.65 | 91.26 |
| LDFSAM | 85.81 | 91.79 | 5.05 | 0.63 | 95.82 |
| Method | ISIC 2018 | Kvasir-SEG | ||||
|---|---|---|---|---|---|---|
| IoU (%) ↑ | Dice (%) ↑ | ACC (%) ↑ | IoU (%) ↑ | Dice (%) ↑ | ACC (%) ↑ | |
| U-Net [3] | 76.77 | 86.55 | 95.00 | 73.04 | 84.56 | 95.50 |
| AttU-Net [34] | 78.19 | 87.54 | 95.33 | 75.67 | 86.20 | 95.90 |
| CA-Net [40] | 68.82 | 80.96 | 92.96 | 71.48 | 83.29 | 94.98 |
| CE-Net [41] | 78.05 | 87.47 | 95.40 | 71.98 | 83.72 | 94.91 |
| CPF-Net [42] | 78.47 | 87.70 | 95.52 | 71.11 | 83.54 | 94.85 |
| CKDNet [43] | 77.89 | 87.35 | 95.27 | 70.23 | 82.74 | 94.60 |
| nnU-Net [4] | 80.10 | 88.42 | 95.88 | 87.15 | 92.90 | 98.11 |
| SegFormer [39] | 87.33 | 92.60 | 97.51 | 86.87 | 92.11 | 97.07 |
| TransUNet [16] | 82.45 | 89.53 | 96.39 | 77.65 | 86.82 | 96.04 |
| LDFSAM | 88.47 | 93.71 | 97.86 | 87.22 | 93.04 | 98.39 |
| Method | IoU (%) ↑ | Dice (%) ↑ | HD (mm) ↓ | ASSD (mm) ↓ | ACC (%) ↑ |
|---|---|---|---|---|---|
| U-Net [3] | 80.06 | 88.38 | 18.59 | 3.57 | 96.01 |
| nnU-Net [4] | 84.66 | 91.02 | 13.78 | 1.87 | 96.55 |
| SegFormer [39] | 82.52 | 90.14 | 15.11 | 2.42 | 96.30 |
| TransUNet [16] | 81.35 | 89.30 | 15.69 | 2.78 | 96.13 |
| LDFSAM | 86.56 | 92.21 | 12.05 | 1.80 | 97.10 |
| Public Dataset | Private Dataset | |||
|---|---|---|---|---|
| IoU (%) ↑ | Dice (%) ↑ | IoU (%) ↑ | Dice (%) ↑ | |
| SAM-Med2D [8] | 82.36 | 90.10 | 79.71 (↓2.65) | 88.24 (↓1.86) |
| nnUNet [4] | 85.28 | 91.45 | 81.87 (↓3.41) | 89.33 (↓2.12) |
| LDFSAM | 85.81 | 91.79 | 84.22 (↓1.59) | 90.87 (↓0.92) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhao, X.; Wang, C.; Xu, H.; Zhou, H.; Yu, Z.; Chen, T.; Wei, X.; Zhang, R. LDFSAM: Localization Distillation-Enhanced Feature Prompting SAM for Medical Image Segmentation. J. Imaging 2026, 12, 74. https://doi.org/10.3390/jimaging12020074
Zhao X, Wang C, Xu H, Zhou H, Yu Z, Chen T, Wei X, Zhang R. LDFSAM: Localization Distillation-Enhanced Feature Prompting SAM for Medical Image Segmentation. Journal of Imaging. 2026; 12(2):74. https://doi.org/10.3390/jimaging12020074
Chicago/Turabian StyleZhao, Xuanbo, Cheng Wang, Huaxing Xu, Hong Zhou, Zekuan Yu, Tao Chen, Xiaoling Wei, and Rongjun Zhang. 2026. "LDFSAM: Localization Distillation-Enhanced Feature Prompting SAM for Medical Image Segmentation" Journal of Imaging 12, no. 2: 74. https://doi.org/10.3390/jimaging12020074
APA StyleZhao, X., Wang, C., Xu, H., Zhou, H., Yu, Z., Chen, T., Wei, X., & Zhang, R. (2026). LDFSAM: Localization Distillation-Enhanced Feature Prompting SAM for Medical Image Segmentation. Journal of Imaging, 12(2), 74. https://doi.org/10.3390/jimaging12020074

