VG-SAM: Visual In-Context Guided SAM for Universal Medical Image Segmentation

Dai, Gang; Wang, Qingfeng; Qin, Yutao; Wei, Gang; Huang, Shuangping

doi:10.3390/fractalfract9110722

Open AccessArticle

VG-SAM: Visual In-Context Guided SAM for Universal Medical Image Segmentation

by

Gang Dai

,

Qingfeng Wang

,

Yutao Qin

,

Gang Wei

and

Shuangping Huang

^*

School of Electronics and Information, South China University of Technology, Guangzhou 510641, China

^*

Author to whom correspondence should be addressed.

Fractal Fract. 2025, 9(11), 722; https://doi.org/10.3390/fractalfract9110722

Submission received: 12 September 2025 / Revised: 27 October 2025 / Accepted: 6 November 2025 / Published: 8 November 2025

(This article belongs to the Special Issue Fractional and Fractal Methods in Biomedical Imaging and Time Series Learning)

Download

Browse Figures

Versions Notes

Abstract

Medical image segmentation, driven by the intrinsic fractal characteristics of biological patterns, plays a crucial role in medical image analysis. Recently, universal image segmentation, which aims to build models that generalize robustly to unseen anatomical structures and imaging modalities, has emerged as a promising research direction. To achieve this, previous solutions typically follow the in-context learning (ICL) framework, leveraging segmentation priors from a few labeled in-context references to improve prediction performance on out-of-distribution samples. However, these ICL-based methods often overlook the quality of the in-context set and struggle with capturing intricate anatomical details, thus limiting their segmentation accuracy. To address these issues, we propose VG-SAM, which employs a multi-scale in-context retrieval phase and a visual in-context guided segmentation phase. Specifically, inspired by the hierarchical and self-similar properties in fractal structures, we introduce a multi-level feature similarity strategy to select in-context samples that closely match the query image, thereby ensuring the quality of the in-context samples. In the segmentation phase, we propose to generate multi-granularity visual prompts based on the high-quality priors from the selected in-context set. Following this, these visual prompts, along with the semantic guidance signal derived from the in-context set, are seamlessly integrated into an adaptive fusion module, which effectively guides the Segment Anything Model (SAM) with powerful segmentation capabilities to achieve accurate predictions on out-of-distribution query images. Extensive experiments across multiple datasets demonstrate the effectiveness and superiority of our VG-SAM over the state-of-the-art (SOTA) methods. Notably, under the challenging one-shot reference setting, our VG-SAM surpasses SOTA methods by an average of

6.61 %

in DSC across all datasets.

Keywords:

medical image analysis; universal medical image segmentation; Segment Anything Model (SAM); in-context learning

1. Introduction

Over the past decades, medical image analysis has evolved from an assistive diagnostic tool into a critical component of modern medical practice. It plays a pivotal role in early disease screening, precise diagnosis, the formulation of personalized treatment plans, as well as monitoring treatment responses and evaluating prognoses. Notably, many biological structures in medical images exhibit fractal-like patterns, characterized by self-similarity and hierarchical complexity [1]. These patterns offer valuable insights for understanding and modeling the intricate structures in medical images. Among the various techniques in medical image analysis, image segmentation is a fundamental one. The goal of medical image segmentation is to partition medical images into clear semantic regions, such as accurately identifying and delineating organs or lesion areas like the liver, kidneys, or tumors from CT images [2]. Medical image segmentation serves as the critical first step towards achieving intelligent image analysis, providing a vital foundation for subsequent high-level semantic understanding and applications. Through accurate segmentation, clinicians can more precisely identify and locate pathological regions, thereby supporting clinical diagnosis and treatment decision-making.

Recently, universal medical image segmentation [3,4,5], aimed at handling diverse imaging modalities and anatomical structures (cf. Figure 1), has emerged as a significant research direction. However, these solutions face a critical challenge in clinical practice—domain shift. This issue arises from various factors, including differences in scanning equipments, imaging protocols (such as sequence types, slice thickness, and spacing), reconstruction algorithms, patient-specific characteristics (such as age, body type, physiological state) and the inherent heterogeneity of diseases. Even when trained on diverse datasets, universal models often experience significant performance degradation on out-of-distribution samples, severely limiting their robustness and practical applicability. Consequently, enhancing the domain adaptability of universal segmentation models has become a pressing challenge in current medical image segmentation research.

$Fractalfract 09 00722 g001$

Figure 1. Examples of diverse medical images from Med2D-20M dataset [6].

$Fractalfract 09 00722 g001$

Traditional approaches [7,8] to addressing domain shift typically rely on a few unlabeled or labeled datasets from the target domain to finetune the model. However, a satisfactory model should achieve superior performance on unseen domains without additional training or fine-tuning. Recently, the framework of In-Context Learning [9] has demonstrated remarkable potential in the field of natural language processing, offering novel insights into tackling domain shift challenges in medical image segmentation. Unlike previous fine-tuning based domain adaptation methods, In-Context Learning aims to guide the model to rapidly understand the distribution characteristics of the target domain by providing a few samples (a.k.a, in-context examples) during inference. This paradigm endows the model with quick adaptability to unseen domains, enhancing its flexibility and practicality without requiring extra model parameter updates.

Building on the above insights, several studies [5,10,11] introduce In-Context Learning (ICL) to universal medical image segmentation. Specifically, UniverSeg [3] and Neuralizer [10] design specialized fusion modules and integrate them into UNet [12] to facilitate information interaction between query images and in-context images. This integration allows the model to leverage segmentation priors from in-context images to improve prediction accuracy for query images. Gao et al. [11] propose a dual-similarity query mechanism that combines image semantic similarity and mask appearance similarity to retrieve and utilize visual in-context examples. However, these ICL-based methods suffer from two major limitations: (1) Their in-context learning capability heavily depends on the quality of the selected examples; poor-quality samples can significantly misguide the model, resulting in degraded performance. (2) They struggle to effectively capture fine-grained anatomical structures, such as subtle lesions or complex boundaries, which limits their ability to deliver accurate segmentation results.

To address the above issues and achieve efficient universal medical image segmentation, we propose VG-SAM, a novel in-context guided universal medical image segmentation method. The idea of VG-SAM is to introduce the guiding capability of visual in-context learning into the powerful zero-shot segmentation capabilities of the Segment Anything Model (SAM) [13], thereby enhancing the segmentation performance in out-of-distribution medical images. Our VG-SAM consists of two key steps: multi-scale in-context retrieval and visual in-context guided segmentation. Specifically, inspired by the hierarchical organization and self-similarity inherent in fractal structures, during the retrieval phase, an efficient in-context selection strategy is performed to ensure the acquisition of high-quality in-context examples most relevant to the query image. Semantic guidance signals are then constructed from the prior knowledge of these high-quality samples. In the segmentation phase, to further enhance the utilization of in-context information, we propose a visual prompt construction module designed to generate multi-granularity visual prompts. Finally, an adaptive fusion module seamlessly integrates the semantic guidance signals and visual prompts to guide SAM in producing more precise segmentation masks, enabling robust and high-performance segmentation of complex medical images.

To sum up, our study has the following four contributions:

We propose a novel VG-SAM method for universal medical image segmentation, which achieves deep integration of in-context learning with Segment Anything Model (SAM) and effectively addresses the performance degradation in unseen domains.
We design an in-context retrieval strategy based on multi-level feature similarity to select the most relevant in-context examples for the query image and extract high-quality priors from them to construct semantic guidance signals.
We propose a visual in-context generation strategy that leverages a visual prompt construction module to generate multi-granularity visual prompts tailored for SAM. Furthermore, an adaptive fusion module seamlessly integrates semantic guidance signals and visual prompts to effectively guide SAM’s segmentation process, ensuring the generation of high-quality segmentation masks.
We evaluate our VG-SAM on multiple publicly available medical segmentation datasets encompassing diverse imaging modalities and anatomical structures. Extensive experiments demonstrate the effectiveness and superiority of our VG-SAM. Our source code is publicly available at: https://github.com/dailenson/VG-SAM (accessed on 13 October 2025).

2. Related Work

2.1. Task-Specific Medical Image Segmentation

The rapid advancement of deep learning technology has significantly driven progress in medical image segmentation. Numerous studies have successfully leveraged neural network-based models, including Convolutional Neural Network (CNN) [14], Vision Transformer (ViT) [15], to tackle medical image segmentation tasks. Specifically, UNet [12] incorporates skip connections, which directly convey high-resolution feature maps from the down-sampling stages to their corresponding up-sampling stages. This mechanism effectively mitigates the loss of spatial details of the down-sampling process. With its simple yet powerful design, U-Net achieves robust performance even when medical datasets are relatively small, solidifying its role as a foundational model for further research [16,17,18].

Previous U-Net-based methods typically rely on CNN architectures, which are inherently limited in capturing global long-range dependencies within images. This limitation poses a significant bottleneck, particularly in scenarios that require a comprehensive understanding of the overall structure of organs or the identification of large-scale pathological patterns. To address this challenge, researchers have turned their attention to transformer architectures [15], which excel at global attention modeling. Popular solutions include hybrid architectures like TransUNet [19], as well as purely transformer-based designs such as SwinUNet [20] and UNETR [21]. These transformer-based approaches have shown remarkable advantages, particularly in tasks that require modeling long-range dependencies.

However, the above segmentation methods are tailored specifically for certain imaging modalities, individual organs, or lesions, such as CT lung nodule segmentation, MRI brain tumor segmentation, and ultrasound breast tumor segmentation. While these task-specific models [22,23,24] excel in their designated tasks, their generalization capabilities are significantly limited. When faced with unseen imaging modalities, organs, or lesions, they are required to redesign the model architecture and perform extensive parameter updates. This adaptation process is both time-consuming and costly, making it challenging to apply these models across various clinical scenarios.

2.2. Universal Medical Image Segmentation

Universal medical image segmentation seeks to address the challenges posed by data heterogeneity across different imaging modalities and anatomical structures. In this section, we review prior studies on universal medical image segmentation, categorizing existing methods into two main categories: SAM-based and ICL-based.

SAM-based Methods.

The Segment Anything Model (SAM) [13] leverages an interactive prompt-based mechanism and large-scale image pretraining techniques to achieve powerful natural image segmentation capabilities. SAM’s success has inspired researchers to explore extending it to the medical image field. However, when directly applied to medical images, SAM exhibits a noticeable performance drop [25,26]. This decline is primarily due to: (1) significant appearance differences between natural and medical images, especially in color, brightness, and contrast; and (2) accurately segmenting boundaries of target objects (e.g., tissues and organs) often relies on domain-specific expert knowledge, which is generally lacking in annotation processes for natural images.

Some efforts [27,28] have been made to leverage fine-tuning techniques to efficiently adapt SAM to the medical images. Specifically, MedSAM [29] is the first to introduce a large-scale medical image dataset, enabling fine-tuning of SAM’s prompt encoder and mask decoder to enhance segmentation performance on medical datasets. Alternatively, SAM-Med2D [30] inserts Adapter [31] layers after each transformer layer of SAM image encoder, effectively transferring the knowledge SAM learned from natural images to medical image segmentation tasks. Similarly, SAMed [32] and Medical SAM-Adapter [33] freeze SAM’s image encoder and adopt Low-Rank Adaptation (LoRA) [32] to update a few parameters. While these SAM-based approaches have achieved notable progress in medical image segmentation, they still rely on additional architectural modifications and training procedures, leading to relatively high adaptation costs.

ICL-based Methods. In-context learning, as introduced by GPT-3 [9], enables models to efficiently handle unseen domains (i.e., out-of-distribution samples) guided by in-context examples during the inference phase. Unlike fine-tuning strategies, in-context learning enables models to generalize to entirely new data without requiring any re-training, demonstrating remarkable adaptability and flexibility. In the vision community, Flamingo [34] leverages language prompts to describe image and video information, guiding vision-language models to exhibit strong domain generalization capabilities. Recently, purely visual in-context learning methods [10,35], which no longer rely on language instructions, have been proposed. These approaches directly utilize the fusion of in-context images and query images to enhance the model’s ability to handle unseen domains. For instance, SegGPT [35] reformulates image segmentation as an in-context coloring task with random color mapping for each data sample, guide the model to complete the coloring task under in-context information. During inference, SegGPT efficiently performs various image and video segmentation tasks using only image-mask pairs as prompts.

For medical image segmentation, UniverSeg [3] introduces an innovative module called CrossBlock. This module, composed of cross-convolution layers, is integrated into the U-Net architecture. The CrossBlock module facilitates information exchange between query images and in-context images, thereby enabling the model to acquire in-context learning capabilities. Gao et al. [11] optimizes the selection and utilization of in-context samples, improving the overall effectiveness of in-context learning. More recently, ICL-SAM [36] has attempted a straightforward integration of in-context learning with SAM. However, it faces significant challenges in generalizing to medical images under zero-shot settings. The primary limitation lies in its reliance on the confidence map generated by the original SAM architecture, which performs poorly when applied to unseen domains. In contrast, our VG-SAM employs a novel visual in-context generation strategy and multi-scale retrieval mechanism, achieving a seamless and effective integration with SAM and delivering superior performance in universal medical image segmentation tasks.

3. Proposed Method

Problem Statement. Our goal is to accurately predict the segmentation mask for a query image, conditioned on a set of in-context examples. For a given query image

I_{q}

, along with an in-context reference set

D_{s} = {(x_{i}, y_{i})}_{i = 1}^{N}

comprising N image-label pairs, the segmentation task can be defined as follows:

{\hat{y}}_{q} = F_{θ} (I_{q}, D_{s}),

(1)

where

{\hat{y}}_{q}

denotes the predicted segmentation mask for

I_{q}

. In this framework,

D_{s}

serves as the reference condition, guiding

F_{θ}

to generate segmentation masks across diverse query images without the need for model re-training.

3.1. Overall Scheme

To achieve robust segmentation performance, we propose VG-SAM, a novel visual in-context guided SAM-based method for universal medical segmentation. As shown in Figure 2, the architecture of VG-SAM consists of three main components: multi-scale in-context retrieval strategy, visual in-context generation strategy, SAM-based adaptive fusion module. In the retrieval phase, a shared pre-trained encoder is used to extract multi-level features from both the query image and in-context images. By calculating feature similarity, we select the most relevant context candidates for the query image, which are subsequently employed to construct the semantic guidance signal. During the segmentation phase, the selected in-context candidates and the query image are fed into a vision construction module. Initially, a coarse segmentation module extracts preliminary segmentation predictions (a.k.a intermediate masks) and a visual-prompt construction module generates multi-granularity visual prompts. Guided by the semantic guidance signal, SAM then receives these visual prompts to perform refined segmentation predictions. Finally, the intermediate and refined masks are adaptively fused to deliver robust segmentation results with precise boundaries.

3.2. Multi-Scale In-Context Retrieval

Multi-level Feature Similarity Selection. Not all in-context samples contribute positively to the segmentation of the query image; in fact, some irrelevant in-context samples may even degrade performance. Thus, extracting prior knowledge that is closely related to the query sample from the provided in-context samples is crucial for subsequent segmentation. To achieve this, we propose a retrieval strategy based on multi-level feature similarity. This involves extracting high-dimensional semantic features of images to assess the similarity between each context sample and the query image. Specifically, to fully leverage the semantic representations across different levels, we feed the query image and all in-context images into a shared pre-trained SAM encoder E, enabling the extraction of multi-level features from various layers at different depths. This hierarchical feature extraction ensures that both low-level details and high-level semantic information are effectively captured for similarity calculation.

For an in-context image

s_{i} \in D_{s}

and the query image

I_{q}

, the extracted features at the l-th layer of E are represented as

z_{s}^{l} = E^{l} (s_{i})

, and

z_{q}^{l} = E^{l} (I_{q})

, respectively. We further compare the feature similarity between each in-context image and the query image. After obtaining similarity scores, we select the top K in-context images with the highest scores to construct the in-context candidates

s^{'}

. This retrieval process can be expressed as:

\begin{matrix} {Score}_{i} = \sum_{l \in {4, 8, 12}} w_{l} Sim (z_{s}^{(l)}, z_{q}^{(l)}), \\ s^{'} = TopK ({Score}_{i}, K), \end{matrix}

(2)

where

{Score}_{i}

represents the similarity score between the i-th in-context image and query image,

w_{i}

denotes the weight factor,

Sim (\cdot)

refers to the cosine similarity,

TopK (\cdot)

selects the top K images with the highest scores, and K indicates the number of in-context candidates. The selected K in-context images are subsequently used for constructing semantic guidance and generating multi-granularity visual prompts.

Semantic Guidance construction. We leverage in-context candidates obtained during the retrieval phase to construct semantic guidance, thereby transforming prior information into supervisory signals for the subsequent segmentation process. As shown in Figure 3, our goal is to utilize the annotations from the in-context candidates to assist in generating a coarse segmentation mask for the query image. This mask serves as a semantic signal to guide the segmentation model in focusing on the foreground regions (e.g., organs, lesion areas, and tumors) within the query image. Specifically, we first employ the unsupervised medical image registration algorithm, HyperMorph [37], to compute the deformation field

Φ

between the in-context image

s_{b e s t}^{'}

with the highest similarity and the query image:

Φ = f_{h y p} (s_{b e s t}^{'}, q),

(3)

where

f_{h y p}

denotes the HyperMorph algorithm. The obtained deformation field

Φ

is further applied to the ground truth mask

{\hat{y}}_{s}

of

s_{b e s t}^{'}

through a Spatial Transform Network (STN) [38], generating the semantic guidance signal

M_{g}

for the query image

I_{q}

:

M_{g} = f_{S T N} (m, ϕ) .

(4)

Here, M serves as a guiding condition to effectively enhance the SAM’s ability to perceive the target object.

f_{S T N}

is an untrainable neural network designed for spatial transformations.

3.3. Visual In-Context Guidance for SAM-Based Segmentation

Multi-granularity Prompt Generation. After obtaining the in-context candidates from the retrieval stage, our goal is to effectively use their prior knowledge to enhance the performance of the subsequent segmentation process. As shown in Figure 4, to achieve this, we first predict an initial intermediate mask from the query image and then construct multi-granularity visual prompts to guide the SAM in producing more robust segmentation predictions. Specifically, the in-context candidates

s'

and the query image

I_{q}

are fed into a coarse segmentation model with in-context learning capabilities to predict the initial intermediate mask. For the selection of the coarse model, we follow ICL-SAM [36] and utilize UniverSeg [3], which features a UNet architecture. Subsequently, to generate more precise and comprehensive visual prompts, we input the intermediate mask into a visual-prompt construction module to obtain three types of visual prompts tailored for SAM: points, boxes, and masks. First, to minimize noise interference, we preprocess the intermediate mask using morphological operations [39]. Specifically, we apply erosion to shrink the edge of intermediate mask and use opening operations to reduce noise interference and remove irrelevant regions. The preprocessed intermediate mask serves as the mask-level visual prompt. Next, we employ a minimum bounding box generation algorithm [40] to produce the smallest enclosing rectangle from the preprocessed mask, which acts as the box-level prompt. Finally, we randomly sample five key points from the mask to serve as the point-level prompt. By combining these three different types of visual prompts, we provide rich guiding conditions for the SAM’s decoding process. The multi-granularity prompt set p is then formulated as:

p = {M_{r e f i n e d}, f_{m i n b o x} (M_{r e f i n e d}), f_{s a m p l e} (M_{r e f i n e d}, k = 5)},

(5)

{\hat{y}}_{m i d} = f_{s e g} (s', I_{q}),

(6)

M_{r e f i n e d} = f_{m o r p h} ({\hat{y}}_{m i d}) .

(7)

Here,

M_{r e f i n e d}

refers to the refined intermediate mask obtained through a morphological operation

f_{m o r p h}

.

{\hat{y}}_{m i d}

is an initial intermediate mask generated by a coarse segmentation model

f_{s e g}

.

f_{m i n b o x}

denotes a minimum bounding box generation algorithm, while

f_{s a m p l e} (\cdot, k)

is a function that performs random sampling k times.

Adaptive Fusion for Medical Segmentation. The proposed adaptive fusion module is designed to leverage multi-granularity visual prompts and semantic signals to guide the SAM, which possesses strong segmentation capabilities, to achieve robust segmentation performance even in unseen domains. However, efficiently integrating visual prompts and semantic signals into the SAM is not a trivial task. To address this, we first input the visual prompts into the SAM encoder to provide spatial guidance. Subsequently, during the mask decoding phase, we transform the semantic signals

M_{g}

, which contain prior information, into an attention bias matrix B. This matrix is then input into the final attention layer of the mask decoder, thereby enhancing the SAM’s perception of the target regions.

Formally, given the extracted semantic signal

M_{g}

, we first apply a down sampling operation to process

M_{g}

. The obtained vector is denoted as

\hat{M} = f_{d o w n} (M_{g})

, where

f_{d o w n}

represents the down sampling operation and

\hat{M}

is the down-sampled semantic signal. Next, we flatten

\hat{M}

into a 1-D vector

M_{f l a} = f_{f l a} (\hat{M})

, where

f_{f l a}

denotes the flattening operation. We then transform

f_{f l a}

into a 2-D attention bias matrix B. Each element

B_{i, j}

of B is computed as:

B_{i, j} = γ (m_{i} + m_{j}), B_{i, j} \in B, m_{i} \in M_{f l a}, m_{j} \in M_{f l a},

(8)

where

γ

is a scaling factor that controls the influence of the semantic signal in the attention mechanism. The idea behind constructing the attention bias matrix B is as follows: if

m_{i} = m_{j} = 1

, indicating that they both belong to the foreground region, then the corresponding attention bias

B_{i, j} = 1

is used to enhance the perception of the foreground target region; if

m_{i} = m_{j} = - 1

, indicating that they both belong to the background noise region, then the corresponding

B_{i, j} = - 1

is used to suppress background noise; if

m_{i} \neq m_{j}

, indicating that one belongs to the foreground and the other to the background, then

B_{i, j} = 0

indicates no influence is applied. Finally, we modify the computation of the last attention layer in the SAM decoder to efficiently incorporate the attention bias matrix B:

{Score}^{'} = softmax (\frac{Q K^{T}}{\sqrt{d}} + B),

(9)

where

{S c o r e}^{'}

represents the attention scores, and Q and K are the standard query and key vectors. This modification enables SAM to more effectively focus on foreground features, even when the bounding box is inaccurate, thereby improving overall segmentation accuracy. By applying the modified attention layer to the SAM decoder

{D e c o d e r}^{'}

, we further obtain a refined segmentation mask:

{\hat{y}}_{S A M} = {Dec}^{'} (I_{q}, P),

where

I_{q}

is the query image and P represents the multi-granularity visual prompt.

After obtaining the coarse intermediate mask

{\hat{y}}_{m i d}

and the refined segmentation mask

{\hat{y}}_{S A M}

from the SAM decoder, we adopt a mask fusion strategy to integrate the prior knowledge from the intermediate mask and the accurate SAM prediction. The fusion process can be expressed as:

{\hat{y}}_{q} = β {\hat{y}}_{m i d} + (1 - β) {\hat{y}}_{S A M},

(10)

where

{\hat{y}}_{q}

denote the final mask prediction,

β

is used to balance the weights of the two masks for better fusion results.

4. Experimental Results and Discussions

4.1. Experimental Settings

Evaluation dataset. To evaluate our VG-SAM in universal medical image segmentation, we use three widely adopted medical image segmentation datasets REFUGE [41], BraTS21 [42], and kiTS23 [43]. To clearly present the characteristics of these datasets, we summarize the three datasets in Table 1 according to four aspects: segmentation task, imaging modality, target object, and data scale. These datasets are designed for different research fields and exhibit significant differences in imaging protocol, target organs, lesion types, and segmentation complexity. We select these diverse datasets to simulate real-world scenarios of domain differences encountered in medical image analysis. This aims to validate the effectiveness and generalization capability of the proposed VG-SAM when faced with varying data domains. We provide detailed statistics for these three datasets below.

The REFUGE (Retinal Fundus Glaucoma Challenge) dataset utilizes fundus images (i.e., color fundus photographs) as its primary imaging modality. This dataset focuses on the precise segmentation of key objects for glaucoma diagnosis: the Optic Disc (OD) and Optic Cup (OC). REFUGE comprises a total of 1200 color fundus images, evenly divided into a training set, an offline validation set, and an online test set, with each subset consisting of 400 images. REFUGE is the first publicly available large-scale fundus image dataset specifically dedicated to glaucoma assessment. It provides precise pixel-level annotations of OD/OC and clinical glaucoma diagnosis labels. In our experiments, the REFUGE dataset is used to evaluate the model’s adaptability and performance in ophthalmic fields.

The BraTS21 (Brain Tumor Segmentation Challenge 2021) dataset focuses on brain tumor segmentation task, specifically gliomas. It utilizes multi-parametric MRI (mpMRI) imaging modalities, with each case typically including four sequences: T1-weighted (T1), contrast-enhanced T1-weighted (T1ce), T2-weighted (T2), and T2 Fluid-Attenuated Inversion Recovery (FLAIR). These different sequences provide complementary information about brain tissue and tumor heterogeneity, which is crucial for accurately identifying and segmenting various tumor subregions. The BraTS21 dataset consists of 2000 cases, divided into 1251 for training, 219 for validation, and 530 for testing. It is worth emphasizing that the BraTS21 dataset is derived from real-world clinical practices across multiple medical institutions, including real mpMRI scans of glioma patients and detailed annotations by neuroradiology experts. The data heterogeneity introduced by scanning equipment and protocols across institutions requires segmentation models to exhibit strong robustness. Specifically, these models must have the generalization ability to effectively adapt to various clinical environments represented in the BraTS21 dataset. Consequently, the BraTS21 dataset is employed to evaluate whether the proposed VG-SAM can effectively address domain shift challenges.

The KiTS23 dataset is composed of CT modalities and includes 599 cases, with 489 cases for training and 110 for testing. KiTS23 provides precise semantic segmentation annotations for three categories: kidneys, kidney tumors, and kidney cysts. This dataset presents several generalization challenges, including the distinct imaging characteristics between kidneys and tumors, the potential coexistence of tumors and cysts, and the wide variance in tumor sizes across different cases. These features make KiTS23 an ideal benchmark dataset for evaluating the generalization capabilities of medical image segmentation models.

Evaluation metric. We use the widely used Dice Similarity Coefficient (DSC) [2,20,44] as an evaluation metric to evaluate the performance of medical segmentation methods. The DSC measures the degree of overlap between the predicted segmentation results and the ground truth annotations. The computation of DSC is defined as follows:

DSC = \frac{2 | A \cap B |}{| A | + | B |} \cdot 100 %,

(11)

where A represents the predicted mask, B denotes the ground truth mask, and

| \cdot |

is the absolute value. The DSC value ranges from 0 to 1, where 1 indicates perfect overlap between the predicted segmentation and the ground truth, while 0 signifies no overlap at all. By incorporating both the predicted and the ground truth masks in the denominator, DSC is particularly effective in handling medical images with varying volumes or shapes. In medical image analysis, DSC is commonly used to evaluate the segmentation accuracy of organs, tumors, and lesions in CT and MRI.

Compared methods. We compare our VG-SAM with state-of-the-art universal image segmentation methods, including SegGPT [35], UniverSeg [3], Neuralizer [10], and ICL-SAM [36]. Specifically, SegGPT is an in-context segmentation method for general images, while UniverSeg, Neuralizer, and ICL-SAM are universal segmentation methods tailored for medical images. To ensure a fair comparison, all models are evaluated directly on the standard test sets without any additional re-training.

Implementation details. We implement our VG-SAM in the PyTorch 1.13 deep learning framework. For the proposed Multi-scale In-Context Retrieval strategy, we utilize image features from the 4th, 8th, and 12th layers of the encoder for similarity computation, with weight factors of 0.2, 0.4, and 0.4, respectively. The scaling factor

λ

for the semantic signal is set to 0.2. Regarding the mask fusion strategy, we set the fusion weight

β

to 0.4. After conducting adaptation ablation studies on various SAM architectures, we ultimately select SEG-SAM [45] as the base segmentation model for our framework. During the testing, our VG-SAM model runs on a single RTX 4090 GPU.

We employ a square structuring element with a kernel size of 3×3 for morphological operations. Our axis-aligned bounding box generation algorithm does not require any user-defined parameters. Specifically, this deterministic algorithm finds the minimum and maximum coordinates along the x and y axes for all non-zero pixels within the preprocessed mask. The result is an upright rectangle, whose sides are parallel to the image axes, that tightly encloses the segmentation target. For the point sampling strategy, we use deterministic seeds to ensure the reproducibility of our experiments. We confirm our experiments across different datasets follow the above parameter configurations.

In our experiments, we follow standard preprocessing procedures, consistent with previous methods [3,36]. Specifically: (1) For 3D datasets (e.g., CT and MRI), the intensity values of each volume are normalized to the range [0, 255]. The 3D data are then split into 2D slice images along the x, y, and z axes. (2) For 2D datasets (e.g., Fundus), we ensure that pixel values are within the range [0, 255]. The preprocessed images are saved in PNG format at three different resolutions:

240 \times 240

,

512 \times 512

,

1634 \times 1634

. We confirm that during testing, all query images and selected in-context samples are sourced from different patients, eyes, or cases to ensure a fair comparison.

4.2. Main Results

Quantitative results. Table 2 presents the quantitative comparisons of our VG-SAM against competitors competitors across different datasets. From these results, we can observe that that our VG-SAM consistently outperforms all state-of-the-art methods on the KiTS23, BraTS21, and REFUGE datasets, with varying numbers of in-context sets. Notably, under the most challenging one-shot reference setting, our method achieves a

5.19 %

increase (

73.91 \to 77.75

) in DSC on the REFUGE dataset compared to the second-best method, ICL-SAM. On the BraTS21 dataset, it outperforms ICL-SAM with a notable

9.01 %

improvement (

33.87 \to 36.92

) in DSC. On the KiTS23 dataset, VG-SAM improves DSC by

5.63 %

(

64.55 \to 68.19

). These experimental results demonstrate the superior segmentation performance and capability of our VG-SAM across different data domains.

Qualitative comparisons. We further provide qualitative results to intuitively explain the superiority of our VG-SAM. The qualitative visualizations in Figure 5 strongly demonstrate the superiority of our approach against UniverSeg [3] and ICL-SAM [36] across various scenarios. In summary, we observe that our method consistently produces accurate, complete, and finely detailed segmentation results both when dealing with structures featuring clear boundaries and in challenging scenarios such as blurry edges, small targets, or significant background interference. This capability is likely attributed to our in-context filtering strategy and the multi-granularity visual prompt construction mechanism, which guide SAM to better understand the target objects and capture its fine features.

More specifically, Figure 5 shows that our VG-SAM can distinctly segment each organ with clear boundaries that closely align with the ground truth masks. In contrast, the results from UniverSeg and ICL-SAM exhibit some degree of overlap, with blurred boundaries and less precise shape restoration. While other methods can also locate the rough area, they struggle to capture the internal details and boundary smoothness. Overall, the visual results in Figure 5 indicate that our VG-SAM surpasses existing methods in complex target location, boundary refinement, demonstrating superior segmentation performance.

4.3. Analyses and Discussions

In this section, we present ablation studies on the REFUGE, BraTS21, and KiTS23 datasets to thoroughly analyze our VG-SAM and the actual contributions of each component. For evaluation, we use the DSC metric, with a default of 16 in-context samples.

Analysis of in-context selection. To explore the impact of the in-context selection strategy, we construct a variant of VG-SAM, modifying our optimal in-context set to randomly selected samples as the in-context set. To ensure a fair comparison, we randomly sample five times and average the results for comparison. From Table 3, it is observed that compared to the variant, our VG-SAM shows an increase in the DSC by

3.69 %

(

82.87 \to 85.93

),

2.66 %

(

72.13 \to 74.05

), and

5.63 %

(

84.93 \to 89.72

) on the REFUGE, BraTS21, and KiTS23 datasets, respectively. This demonstrates that our selection strategy effectively enhances segmentation performance. This improvement is attributed to two main facts: (1) the original in-context set often contains obvious noise, and (2) some images differ greatly from the query image in terms of shape, texture, and mask size. These issues can interfere with the model’s segmentation. By introducing our in-context selection strategy, the model is able to retrieve in-context images that are more similar to the query image, thereby extracting more relevant in-context information and achieving better segmentation results.

Discussions on semantic guidance. We present ablation experiments in Table 3 to validate the effectiveness of the semantic guidance signals. As shown Table 3, VG-SAM achieves superior segmentation performance, with the DSC improving by

4.43 %

(

82.28 \to 85.93

),

5.08 %

(

70.47 \to 74.05

), and

6.15 %

(

84.52 \to 89.72

) on all datasets, respectively, compared to the “row”

w / o

semantic guidance. These results demonstrate that the extracted semantic guidance signals effectively assist SAM in focusing more accurately on the foreground regions of the query images. This is because the semantic guidance signals are derived from high-quality in-context samples, which encourage the model to extract more critical features that are highly semantically relevant to the query images, thus enhancing SAM’s segmentation capability.

Effect of multi-granularity prompt. We conduct ablation experiments to evaluate the impact of multi-granularity visual prompts. Specifically, we first remove the multi-granularity visual construction module from VG-SAM. To meet SAM’s native prompt requirements, we directly use the bounding boxes of the ground truth as prompts and input them into the SAM encoder. By comparing the results of this variant with those of VG-SAM, we analyze the differences between using multi-granularity prompts and simple bounding box prompts. As shown in Table 3, the multi-granularity visual prompts, which incorporate richer visual cues such as points, boxes, and masks, achieve superior segmentation performance compared to single bounding boxes. This improvement is reflected in increases of

2.53 %

(

83.81 \to 85.93

),

2.54 %

(

72.21 \to 74.05

), and

3.88 %

(

86.37 \to 89.72

) in the DSC metric across the three datasets, respectively.

Evaluation of mask fusion. As shown in Table 3, removing mask fusion and relying solely on the masks predicted by SAM leads to a slight performance decline, with decreases of

1.75 %

(

85.93 \to 84.42

),

2.26 %

(

74.05 \to 72.38

), and

2.94 %

(

89.72 \to 87.08

) across the three datasets, respectively. This is because SAM occasionally tends to over-segment certain target objects, as highlighted by red circles in Figure 5. While the initial intermediate masks may have imprecise boundary delineations, they effectively capture the general area of interest. Therefore, by fusing the intermediate masks with SAM’s finely predicted masks, our VG-SAM can achieve complementary effects, leading to more precise segmentation results, especially when dealing with complex medical objects.

Ablation on SAM architecture. We conduct ablation experiments to evaluate the effect of different SAM architectures. We evaluate three medical SAM variants — MedSAM [29], SAM-Med2D [30], and SEG-SAM [45], alongside the original SAM. The experimental results in Table 4 indicate that the original SAM [45], despite not being fine-tuned on medical datasets, still achieves competitive segmentation performance. Furthermore, all three medical SAM variants deliver commendable segmentation results. These findings demonstrate that the proposed method effectively adapts to various SAM architectures and exhibits strong adaptability.

Ablation studies on the HyperMorph registration. We conduct ablation studies to quantify the impact of registration. Firstly, we re-implemented VG-SAM by removing the HyperMorph registration from the semantic signal construction. Subsequently, we perform preference experiments to quantify how often registration improves vs harms performance. Specifically, we compare the prediction results of VG-SAM with registration and VG-SAM without registration on the KiTS dataset, counting the number of times each version achieved a better DSC.

As shown in Table 5, the quantitative results indicate that VG-SAM with HyperMorph registration achieves a preference score of

78.05 %

, highlighting that registration enhances performance in the majority of cases. This is due to our high-quality in-context samples selected during the retrieval phase, which exhibit strong similarity to the query samples. Consequently, the derived registration-generated semantic signal provides accurate foreground positional information in most cases. We further compare their prediction performance in terms of DSC in Table 5. The results show that VG-SAM with registration improves DSC by

3.62 %

(

86.58 \to 89.72

) compared to the version without registration. This further highlights the effectiveness of the HyperMorph registration.

We provide visual failure cases to provide a more intuitive analysis for registration. As shown in Figure 6, although the final predictions of VG-SAM become incomplete, they are not significantly degraded due to two key reasons: (1) Our visual prompt guidance is derived from the coarse segmentation mask. The coarse mask provides rough location information for the target regions, helping to correct SAM’s predictions. (2) Our mask fusion strategy combines the coarse segmentation mask with SAM’s predictions, resulting in a more robust final output.

Analysis of computational efficiency. We compare the inference times of our method with those of competitors, including UniverSeg, which utilizes a traditional Unet architecture, and ICL-SAM, which is based on an advanced SAM architecture. As shown in Table 6, due to the low parameter count of the Unet-based design, UniverSeg achieves the lowest inference cost in terms of both time and memory. However, its segmentation performance significantly lags behind SAM-based methods in terms of DSC. Notably, our VG-SAM demonstrates similar inference overhead to ICL-SAM, but improves DSC by

3.97 %

(

86.29 \to 89.72

), further highlighting the superiority of our method.

Our inference involves the retrieval stage and the segmentation stage. We detail their computational efficiency below, respectively. Our retrieval stage is highly efficient because it operates on compact features rather than raw images. Before retrieval, all samples in the retrieval pool are pre-encoded to extract sample features. Note that this pre-encoding process is performed only once, allowing subsequent retrieval to reuse the pre-stored features. To handle images of varying resolutions, we apply adaptive pooling to the features output by the SAM encoder, resulting in a fixed 1024-dimensional vector. For instance, when processing a query sample with a retrieval pool containing 10,147 samples from the KiTS32 dataset, our method selects the Top-16 in-context samples based on their similarity scores. This retrieval process takes only 0.102 s, with a GPU memory consumption of 3291 MB. These results highlight the efficiency of our retrieval stage. During segmentation, our method handles a single sample in 0.095 s with a GPU memory of 4115 MB, as reported in Table 6. This indicates that our model can be deployed on a single consumer-grade GPU, showcasing its potential for practical applications.

Moreover, we investigate the impact of segmentation cost across various image resolutions. We report the segmentation cost for three resolutions corresponding to the BraTS21, KiTS23, and REFUGE datasets in Table 7. It can be observed that our method processes a single sample at three different resolutions in 0.071, 0.095, and 0.406 s, respectively, demonstrating the high inference efficiency of our model. The corresponding GPU memory usage is 4105, 4115, and 4133 MB, respectively.

To provide a comprehensive analysis of the trade-off between computational efficiency and model performance, we have included a speed-accuracy Pareto plot in Figure 7. Specifically, we set the number of in-context samples to 1, 2, 4, 8, and 16, and report the corresponding DSC performance and the average inference time per sample on the KiTS23 dataset. As shown in Figure 7, DSC improves as the number of in-context samples increases, while the inference time also grows accordingly. However, the inference time remains within 0.2 s across all configurations. Notably, when the number of in-context samples is less than 4, increasing the sample count leads to a significant performance boost. Beyond 4 samples, the performance improvement becomes more gradual.

Adaptability of our method. We have already demonstrated the strong performance of our method across three medical image modalities, including MRI, CT, and fundus (cf. Table 2). To further evaluate the adaptability of our method, we conduct experiments on three additional modalities: PET, X-ray, and 3D Ultrasound, using the Med2D-20M dataset [6]. As shown in Figure 8, our VG-SAM demonstrates satisfactory segmentation performance across these diverse modalities, accurately predicting anatomical structures of varying sizes. These results further validate the universality of our method. We credit this success to our seamless integration of in-context learning with SAM, which enables robust predictions on diverse medical image modalities.

Model interpretability analysis. To improve the model interpretability, we conduct a visualization analysis of the prediction results. Specifically, we extract the attention masks from the final token-to-image attention blocks of SAM’s decoder and overlay them onto the original query images to generate attention maps. The visualization results are provided in Figure 9. In the attention maps, a color gradient ranging from light blue to red is used to represent the model’s confidence levels, where red indicates higher confidence and light blue corresponds to lower confidence. These visualizations provide neurologists with richer information for understanding lesion areas compared to the final prediction outputs, thereby facilitating more precise clinical decision-making.

Discussions on potential clinical applications. Our method demonstrates strong generalization performance across various medical imaging modalities, thanks to the seamless integration of in-context learning and SAM, without requiring additional model retraining. In contrast to previous task-specific models, which often demand large amounts of annotated data and costly fine-tuning for new tasks, our approach achieves robust performance on unseen tasks by simply leveraging a few reference examples.

In clinical practice, particularly in ophthalmology, subspecialties such as retina, anterior segment, and strabismus often face challenges in timely diagnosis due to the limited availability of experienced specialists [46]. With the aid of exemplar-guided reasoning tools like ours, clinicians who do not routinely specialize in strabismus or lack advanced training in its evaluation could significantly benefit from assistive technologies that enhance diagnostic accuracy. For instance, in-context learning techniques have been shown to substantially improve the strabismus classification accuracy of GPT-4o [46].

Failure case analysis of our method. We conduct experiments on a highly challenging histopathology dataset [47], which contains microscopic histology structures that differ significantly from the macroscopic anatomical structures modalities commonly seen in CT and MRI datasets. As shown in Figure 10, our model fails to predict accurate target masks. This issue arises because the SAM we use encounters very few histopathology datasets during its large-scale pretraining, leading to poor generalization to this specific histopathology modality. Compared to popular modalities such as CT and MRI, publicly available histopathology datasets typically contain only a limited number of samples [48,49,50], primarily due to the strong privacy constraints associated with histopathology data. The scarcity of histopathology samples poses a significant challenge, as it is insufficient to support the extensive pretraining required by SAM.

A potential solution to this challenge is to replace the current SAM with one that performs better on histopathology datasets. This requires collaboration with hospitals to collect a substantial amount of histopathology data and fine-tune SAM accordingly. We anticipate the development of such a SAM that better supports the complex task of pathological segmentation.

5. Conclusions

In this paper, we propose a novel VG-SAM algorithm for universal medical segmentation. Our VG-SAM seamlessly integrates visual in-context learning into SAM, aiming to enhance its adaptability and robustness across various medical segmentation scenarios conditioned on a few in-context samples. The proposed VG-SAM comprises a multi-scale in-context retrieval phase and a visual in-context guided segmentation phase. To ensure the quality of the in-context samples, we introduce a multi-scale feature similarity strategy in the retrieval phase to select an in-context set that is closely related to the query image. During the segmentation phase, multi-scale visual prompts generated by the visual construction module, along with the retrieved in-context set, are fed into an adaptive fusion module. This module efficiently guides SAM to enhance its perception of the target objects, ultimately producing high-precision segmentation predictions.

VG-SAM demonstrates superior performance across various segmentation tasks, outperforming state-of-the-art methods in multiple medical image modalities, including MRI, CT, and fundus. In our experiments, VG-SAM achieves DSC scores of

85.93 %

,

74.05 %

, and

89.72 %

on the REFUGE, BraTS21, and KiTS23 datasets, respectively, without requiring any adaptation costs. Notably, compared to the second-best method, VG-SAM achieves an average improvement of

6.61 %

in DSC across these three datasets under the challenging one-shot setting, which underscores its robustness and efficiency.

Beyond quantitative results, as shown in Figure 5, VG-SAM consistently delivers accurate segmentations in challenging scenarios, including cases with blurry edges, small targets, or heavy background interference. Furthermore, visualization results in Figure 6 demonstrate that our VG-SAM achieves desirable segmentation predictions even in rare image modalities, such as PET, 3D ultrasound, and X-ray. This highlights the universality of our method in handling a wide range of medical segmentation tasks.

In the future, we plan to explore methods to enhance the interpretability of in-context guided predictions, enabling users to understand how the model utilizes “in-context samples” for predictions, a factor crucial for building clinical trust. Moreover, integrating in-context learning with human interaction, allowing clinicians to dynamically provide or adjust in-context samples and visual prompts, could further improve the accuracy and flexibility of segmentation.

Author Contributions

Methodology, writing—original draft preparation, G.D.; data curation, investigation, Q.W.; validation, visualization, Y.Q.; supervision, resources, G.W.; writing—review and editing, supervision, funding acquisition, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Key Research and Development Program of China (No.2023YFC3502900), the National Natural Science Foundation of China (Nos.62176093, 61673182 and 62576139), and the Guangdong Emergency Management Science and Technology Program (No.2025YJKY001).

Data Availability Statement

REFUGE dataset: https://refuge.grand-challenge.org/; BraTS21 dataset: https://www.synapse.org/Synapse:syn25829067/wiki/610863, KiTS23 dataset: https://kits-challenge.org/kits23/ (accessed on 13 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shao, Y.; Yang, J.; Zhou, W.; Sun, H.; Gao, Q. Fractal-Inspired Region-Weighted Optimization and Enhanced MobileNet for Medical Image Classification. Fractal Fract. 2025, 9, 511. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef] [PubMed]
Butoi, V.I.; Ortiz, J.J.G.; Ma, T.; Sabuncu, M.R.; Guttag, J.; Dalca, A.V. Universeg: Universal medical image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 21438–21451. [Google Scholar]
Chen, H.; Cai, Y.; Wang, C.; Chen, L.; Zhang, B.; Han, H.; Guo, Y.; Ding, H.; Zhang, Q. Multi-organ foundation model for universal ultrasound image segmentation with task prompt and anatomical prior. IEEE Trans. Med. Imaging 2024, 44, 1005–1018. [Google Scholar] [CrossRef]
Chen, Z.; Xu, Q.; Liu, X.; Yuan, Y. Un-sam: Universal prompt-free segmentation for generalized nuclei images. arXiv 2024, arXiv:2402.16663. [Google Scholar]
Ye, J.; Cheng, J.; Chen, J.; Deng, Z.; Li, T.; Wang, H.; Su, Y.; Huang, Z.; Chen, J.; Jiang, L.; et al. Sa-med2d-20m dataset: Segment anything in 2d medical imaging with 20 million masks. arXiv 2023, arXiv:2311.11969. [Google Scholar]
Guan, H.; Liu, M. Domain adaptation for medical image analysis: A survey. IEEE Trans. Biomed. Eng. 2021, 69, 1173–1185. [Google Scholar] [CrossRef] [PubMed]
Ghamsarian, N.; Gamazo Tejero, J.; Márquez-Neila, P.; Wolf, S.; Zinkernagel, M.; Schoeffmann, K.; Sznitman, R. Domain adaptation for medical image segmentation using transformation-invariant self-training. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; pp. 331–341. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Czolbe, S.; Dalca, A.V. Neuralizer: General neuroimage analysis without re-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6217–6230. [Google Scholar]
Gao, J.; Lao, Q.; Kang, Q.; Liu, P.; Du, C.; Li, K.; Zhang, L. Boosting your context by dual similarity checkup for In-Context learning medical image segmentation. IEEE Trans. Med. Imaging 2024, 44, 310–319. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Las Vegas, NV, USA, 27–30 June2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Zhou, Z.; Siddiquee, M.; Tajbakhsh, N.; Liang, J.U. A nested U-Net architecture for medical image segmentation. arXiv 2018, arXiv:1807.10165. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Athens, Greece, 17–21 October 2016; pp. 424–432. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Hatamizadeh, A.; Yang, D.; Roth, H.; Xu, D.U. Transformers for 3d medical image segmentation. arXiv 2021, arXiv:2103.10504. [Google Scholar] [PubMed]
Wasserthal, J.; Breit, H.C.; Meyer, M.T.; Pradella, M.; Hinck, D.; Sauter, A.W.; Heye, T.; Boll, D.T.; Cyriac, J.; Yang, S.; et al. TotalSegmentator: Robust segmentation of 104 anatomic structures in CT images. Radiol. Artif. Intell. 2023, 5, e230024. [Google Scholar] [CrossRef]
Ouyang, C.; Biffi, C.; Chen, C.; Kart, T.; Qiu, H.; Rueckert, D. Self-supervised learning for few-shot medical image segmentation. IEEE Trans. Med. Imaging 2022, 41, 1837–1848. [Google Scholar] [CrossRef]
Wu, J.; Ji, W.; Fu, H.; Xu, M.; Jin, Y.; Xu, Y. Medsegdiff-v2: Diffusion-based medical image segmentation with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; pp. 6030–6038. [Google Scholar]
Hu, X.; Xu, X.; Shi, Y. How to efficiently adapt large segmentation model (sam) to medical images. arXiv 2023, arXiv:2306.13731. [Google Scholar] [CrossRef]
Cheng, Z.; Wei, Q.; Zhu, H.; Wang, Y.; Qu, L.; Shao, W.; Zhou, Y. Unleashing the potential of sam for medical adaptation via hierarchical decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3511–3522. [Google Scholar]
Chen, T.; Zhu, L.; Deng, C.; Cao, R.; Wang, Y.; Zhang, S.; Li, Z.; Sun, L.; Zang, Y.; Mao, P. Sam-adapter: Adapting segment anything in underperformed scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 3367–3375. [Google Scholar]
Killeen, B.D.; Wang, L.J.; Zhang, H.; Armand, M.; Taylor, R.H.; Dreizin, D.; Osgood, G.; Unberath, M. Fluorosam: A language-aligned foundation model for X-ray image segmentation. arXiv 2024, arXiv:2403.08059. [Google Scholar]
Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef]
Cheng, J.; Ye, J.; Deng, Z.; Chen, J.; Li, T.; Wang, H.; Su, Y.; Huang, Z.; Chen, J.; Jiang, L.; et al. SAM-Med2D. arXiv 2023, arXiv:2308.16184. [Google Scholar] [CrossRef]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 2790–2799. [Google Scholar]
Zhang, K.; Liu, D. Customized segment anything model for medical image segmentation. arXiv 2023, arXiv:2304.13785. [Google Scholar] [CrossRef]
Wu, J.; Wang, Z.; Hong, M.; Ji, W.; Fu, H.; Xu, Y.; Xu, M.; Jin, Y. Medical sam adapter: Adapting segment anything model for medical image segmentation. Med. Image Anal. 2025, 102, 103547. [Google Scholar] [CrossRef]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Wang, X.; Zhang, X.; Cao, Y.; Wang, W.; Shen, C.; Huang, T. Seggpt: Segmenting everything in context. arXiv 2023, arXiv:2304.03284. [Google Scholar] [CrossRef]
Hu, J.; Shang, Y.; Yang, Y.; Guo, X.; Peng, H.; Ma, T. Icl-sam: Synergizing in-context learning model and sam in medical image segmentation. Med. Imaging Deep. Learn. 2024, 250, 641–656. [Google Scholar]
Hoopes, A.; Hoffmann, M.; Fischl, B.; Guttag, J.; Dalca, A.V. Hypermorph: Amortized hyperparameter learning for image registration. In Proceedings of the International Conference on Information Processing in Medical Imaging, Virtual Event, 28–30 June 2021; pp. 3–17. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In Proceedings of the NIPS’15: Proceedings of the 29th International Conference on Neural Information Processing Systems—Volume 2, Cambridge, MA, USA, 7–12 December 2015; Volume 28. [Google Scholar]
Haralick, R.M.; Sternberg, S.R.; Zhuang, X. Image analysis using mathematical morphology. IEEE Trans. Pattern Anal. Mach. Intell. 1987, 4, 532–550. [Google Scholar] [CrossRef]
Toussaint, G.T. Solving geometric problems with the rotating calipers. In Proceedings of the Proc. IEEE Melecon, Athens, Greece, 1–8 May 1983; Volume 83, p. A10. [Google Scholar]
Orlando, J.I.; Fu, H.; Breda, J.B.; Van Keer, K.; Bathula, D.R.; Diaz-Pinto, A.; Fang, R.; Heng, P.A.; Kim, J.; Lee, J.; et al. Refuge challenge: A unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Med. Image Anal. 2020, 59, 101570. [Google Scholar] [CrossRef]
Baid, U.; Ghodasara, S.; Mohan, S.; Bilello, M.; Calabrese, E.; Colak, E.; Farahani, K.; Kalpathy-Cramer, J.; Kitamura, F.C.; Pati, S.; et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv 2021, arXiv:2107.02314. [Google Scholar]
Heller, N.; Isensee, F.; Trofimova, D.; Tejpaul, R.; Zhao, Z.; Chen, H.; Wang, L.; Golts, A.; Khapun, D.; Shats, D.; et al. The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct. arXiv 2023, arXiv:2307.01984. [Google Scholar]
Yu, J.; Qin, J.; Xiang, J.; He, X.; Zhang, W.; Zhao, W. Trans-UNeter: A new Decoder of TransUNet for Medical Image Segmentation. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, Istanbul, Turkiye, 5–8 December 2023; pp. 2338–2341. [Google Scholar]
Huang, S.; Liang, H.; Wang, Q.; Zhong, C.; Zhou, Z.; Shi, M. Seg-sam: Semantic-guided sam for unified medical image segmentation. arXiv 2024, arXiv:2412.12660. [Google Scholar]
Choi, J.Y.; Yoo, T.K. Evaluating ChatGPT-4o for ophthalmic image interpretation: From in-context learning to code-free clinical tool generation. Inform. Health 2025, 2, 158–169. [Google Scholar] [CrossRef]
Shi, L.; Li, X.; Hu, W.; Chen, H.; Chen, J.; Fan, Z.; Gao, M.; Jing, Y.; Lu, G.; Ma, D.; et al. EBHI-Seg: A novel enteroscope biopsy histopathological hematoxylin and eosin image dataset for image segmentation tasks. Front. Med. 2023, 10, 1114673. [Google Scholar] [CrossRef] [PubMed]
Gamper, J.; Alemi Koohbanani, N.; Benet, K.; Khuram, A.; Rajpoot, N. Pannuke: An open pan-cancer histology dataset for nuclei instance segmentation and classification. In Proceedings of the European Congress on Digital Pathology, Warwick, UK, 10–13 April 2019; pp. 11–19. [Google Scholar]
Da, Q.; Huang, X.; Li, Z.; Zuo, Y.; Zhang, C.; Liu, J.; Chen, W.; Li, J.; Xu, D.; Hu, Z.; et al. DigestPath: A benchmark dataset with challenge review for the pathological detection and segmentation of digestive-system. Med. Image Anal. 2022, 80, 102485. [Google Scholar] [CrossRef] [PubMed]
Graham, S.; Chen, H.; Gamper, J.; Dou, Q.; Heng, P.A.; Snead, D.; Tsang, Y.W.; Rajpoot, N. MILD-Net: Minimal information loss dilated network for gland instance segmentation in colon histology images. Med. Image Anal. 2019, 52, 199–211. [Google Scholar] [CrossRef] [PubMed]

$Fractalfract 09 00722 g002$

Figure 2. Overview of the proposed method.

$Fractalfract 09 00722 g002$

$Fractalfract 09 00722 g003$

Figure 3. Construction of the semantic guidance signal. In this pipeline, HyperMorph is an unsupervised medical image registration algorithm [37], while Spatial Transform refers to a untrainable spatial transformation network [38].

$Fractalfract 09 00722 g003$

$Fractalfract 09 00722 g004$

Figure 4. Illustration of the multi-granularity visual prompt generation strategy.

$Fractalfract 09 00722 g004$

$Fractalfract 09 00722 g005$

Figure 5. Qualitative comparisons with competitors on REFUGE and KiTS23 datasets.

$Fractalfract 09 00722 g005$

$Fractalfract 09 00722 g006$

Figure 6. Failure case analysis of the HyperMorph registration.

$Fractalfract 09 00722 g006$

$Fractalfract 09 00722 g007$

Figure 7. DSC performance conditioned on the number of in-context samples on the KiTS23 dataset.

$Fractalfract 09 00722 g007$

$Fractalfract 09 00722 g008$

Figure 8. Qualitative results on more medical modalities, including PET, X-ray, and 3D ultrasound.

$Fractalfract 09 00722 g008$

$Fractalfract 09 00722 g009$

Figure 9. Visualization results of attention maps on KiTS23 dataset.

$Fractalfract 09 00722 g009$

$Fractalfract 09 00722 g010$

Figure 10. Failure case analysis of our method on a histopathology dataset [47].

$Fractalfract 09 00722 g010$

Table 1. Overview of the dataset characteristics of REFUGE [41], BraTS21 [42], and kiTS23 [43].

Category	REFUGE	BraTS21	kiTS23
Segmentation task	Glaucoma	Brain tumor	Renal tumor
Imaging modality	2D color fundus	3D MP-MRI	3D CE-CT
Target object	Optic disc/cup	Tumor subregion	Kidney, tumor, cyst
Sample number	1200	4500	599

Table 2. Quantitative comparisons with state-of-the-art methods on universal medical image segmentation on REFUGE, BraTS21, and kiTS23 in terms of DSC. Higher values indicate better performance.

Dataset	Method	Number of In-Context Samples
Dataset	Method	1	2	4	8	16
REFUGE [41]	SegGPT [35]	31.82	38.55	43.19	47.88	51.52
	UniverSeg [3]	57.97	70.80	75.29	78.96	81.48
	Neuralizer [10]	67.74	71.09	73.14	73.25	75.31
	ICL-SAM [36]	73.91	78.89	80.57	82.64	83.91
	Ours	77.75	80.97	82.69	84.85	85.93
BraTS21 [42]	SegGPT [35]	10.15	14.67	21.30	27.76	34.95
	UniverSeg [3]	20.78	30.59	47.06	57.80	67.04
	Neuralizer [10]	21.61	22.84	25.40	25.62	32.02
	ICL-SAM [36]	33.87	45.84	58.79	65.79	71.91
	Ours	36.92	48.58	61.52	68.29	74.05
KiTS23 [43]	SegGPT [35]	25.61	34.02	43.88	51.74	58.26
	UniverSeg [3]	44.87	56.71	71.79	78.38	83.40
	Neuralizer [10]	37.15	52.78	55.10	63.84	65.15
	ICL-SAM [36]	64.55	71.81	80.65	84.39	86.29
	Ours	68.19	75.18	85.45	87.78	89.72

Table 3. Ablation studies on REFUGE, BraTS21, and kiTS23 datasets.

Method	REFUGE	BraTS21	KiTS23
$w / o$ in-context selection	82.87	72.13	84.93
$w / o$ semantic guidance	82.28	70.47	84.52
$w / o$ multi-granularity prompt	83.81	72.21	86.37
$w / o$ mask fusion	84.42	72.38	87.08
VG-SAM (Ours)	85.93	74.05	89.72

Table 4. Effect of different SAM architecture on REFUGE, BraTS21, KiTS23 datasets.

Method	REFUGE	BraTS21	KiTS23
Ours with SAM [13]	79.69	68.26	84.83
Ours with MedSAM [29]	83.17	70.65	83.39
Ours with SAM-Med2D [30]	84.26	72.14	87.37
Ours with SA-SAM [45]	85.93	74.05	89.72

Table 5. Ablation studies of the HyperMorph registration algorithm on KiTS23 dataset.

Dataset	Method	Preference (100%)	DSC
KiTS23 [43]	$w / o$ HyperMorph	21.95	86.58
KiTS23 [43]	Ours (VG-SAM)	78.05	89.72

Table 6. Computational cost comparisons between our VG-SAM and competitors on KiTS23 dataset.

Method	Inference Time (s)			Peak Memory (MB)	DSC
Method	Retrieval	Segmentation	Total	Peak Memory (MB)	DSC
UniverSeg [3]	-	0.020	0.020	2997	83.40
ICL-SAM [36]	-	0.167	0.167	4515	86.29
Ours (VG-SAM)	0.102	0.095	0.197	4115	89.72

Table 7. Segmentation cost analysis of our VG-SAM under different resolutions of query images.

Segmentation Cost	Resolution of Query Images
Segmentation Cost	$240 \times 240$	$512 \times 512$	$1634 \times 1634$
Time (s)	0.071	0.095	0.406
GPU Memory (MB)	4105	4115	4133

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dai, G.; Wang, Q.; Qin, Y.; Wei, G.; Huang, S. VG-SAM: Visual In-Context Guided SAM for Universal Medical Image Segmentation. Fractal Fract. 2025, 9, 722. https://doi.org/10.3390/fractalfract9110722

AMA Style

Dai G, Wang Q, Qin Y, Wei G, Huang S. VG-SAM: Visual In-Context Guided SAM for Universal Medical Image Segmentation. Fractal and Fractional. 2025; 9(11):722. https://doi.org/10.3390/fractalfract9110722

Chicago/Turabian Style

Dai, Gang, Qingfeng Wang, Yutao Qin, Gang Wei, and Shuangping Huang. 2025. "VG-SAM: Visual In-Context Guided SAM for Universal Medical Image Segmentation" Fractal and Fractional 9, no. 11: 722. https://doi.org/10.3390/fractalfract9110722

APA Style

Dai, G., Wang, Q., Qin, Y., Wei, G., & Huang, S. (2025). VG-SAM: Visual In-Context Guided SAM for Universal Medical Image Segmentation. Fractal and Fractional, 9(11), 722. https://doi.org/10.3390/fractalfract9110722

Article Menu

VG-SAM: Visual In-Context Guided SAM for Universal Medical Image Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Task-Specific Medical Image Segmentation

2.2. Universal Medical Image Segmentation

3. Proposed Method

3.1. Overall Scheme

3.2. Multi-Scale In-Context Retrieval

3.3. Visual In-Context Guidance for SAM-Based Segmentation

4. Experimental Results and Discussions

4.1. Experimental Settings

4.2. Main Results

4.3. Analyses and Discussions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI