AGSAM: Agent-Guided Segment Anything Model for Automatic Segmentation in Few-Shot Scenarios

Precise medical image segmentation of regions of interest (ROIs) is crucial for accurate disease diagnosis and progression assessment. However, acquiring high-quality annotated data at the pixel level poses a significant challenge due to the resource-intensive nature of this process. This scarcity of high-quality annotated data results in few-shot scenarios, which are highly prevalent in clinical applications. To address this obstacle, this paper introduces Agent-Guided SAM (AGSAM), an innovative approach that transforms the Segment Anything Model (SAM) into a fully automated segmentation method by automating prompt generation. Capitalizing on the pre-trained feature extraction and decoding capabilities of SAM-Med2D, AGSAM circumvents the need for manual prompt engineering, ensuring adaptability across diverse segmentation methods. Furthermore, the proposed feature augmentation convolution module (FACM) enhances model accuracy by promoting stable feature representations. Experimental evaluations demonstrate AGSAM’s consistent superiority over other methods across various metrics. These findings highlight AGSAM’s efficacy in tackling the challenges associated with limited annotated data while achieving high-quality medical image segmentation.


Introduction
AI-based automated techniques have demonstrated their efficiency and success in medical image segmentation [1,2], which is indispensable for disease diagnosis and progression assessment.However, the foundation of high-quality AI segmentation models lies in obtaining abundant high-quality annotated samples [3].However, segmentation annotation is difficult and laborious for clinical applications, posing a challenge to effectively harnessing limited annotation [4], compared to other medical tasks such as classification [5,6].
This challenge of data scarcity can be addressed through few-shot learning approaches [7].Few-shot learning enables pre-trained models to generalize and segment new categories of data that were unseen during the original training by leveraging just a few labeled samples per class.
The recent success of Generative Pre-Trained Transformer (GPT) models in few-shot tasks [8][9][10], pre-trained on vast datasets, has inspired the development of the Segment Anything Model (SAM), a model trained extensively on data to encode and decode feature for segmentation, exhibiting remarkable few-shot and even zero-shot capabilities [11].SAM-Med2D [12] bridges the gap between SAM's proficiency in natural images and its application in medical 2D image analysis.Studies on SAM and SAM-Med2D have revealed that simple prompts, such as a few point coordinates or rough region vertices, can effectively Bioengineering 2024, 11, 447 2 of 18 guide precise segmentation [12].However, these methods still rely on additional manual prompts, limiting their applicability.
Recent work has leveraged SAM's pre-trained image encoder for automated segmentation of medical images, like AutoSAM [13] and nnSAM [4], by integrating it into new architectures.However, these approaches focus only on SAM's encoding module while underutilizing its extensively trained decoding module.SAM's decoder holds strong value in few-shot scenarios where limited training data make fine-tuning a new decoder inferior to utilizing SAM's already optimized decoder.Moreover, while SAM-Med2D fine-tunes SAM on medical images, it was reported that updating both SAM's encoder and decoder modules, either individually or jointly, enhances segmentation performance, with joint updates achieving greater gains.This suggests that after extensive pre-training on abundant data, SAM's encoder and decoder inherently acquire valuable knowledge beneficial for segmentation tasks [12].Hence, fully utilizing both the encoding and decoding capabilities within a trained SAM represents an unmet need for maximizing performance, especially with scarce training data.
In the few-shot scenario, data are exceedingly scarce, rendering them inadequate for direct fine-tuning of the pre-trained SAM.Consequently, the paper describes training a lightweight segmentation model, known as the agent model, using these limited data.The predictions and acquired semantic features of this agent model are subsequently leveraged to generate prompts for the pre-trained model, thereby activating its encoding and decoding capabilities.Therefore, we propose a simplified agent-guided model to replace manual prompt generation, introducing an optimized framework called Agent-Guided SAM (AGSAM).AGSAM maximizes the utilization of SAM's trained encoder and decoder modules.The fusion of SAM and the guided model is particularly advantageous for limited training data, achieving high-quality medical image segmentation.The main contributions of this paper are summarized as follows: a.
Introducing a novel fully automatic segmentation approach: AGSAM.This method leverages SAM-Med2D's pre-trained feature extraction and generalized decoding capabilities.The guided model seamlessly replaces prompt generation and embedding, making it adaptable to any segmentation method.b.
Introduction of a feature augmentation convolution module (FACM) in AGSAM, a parameter-free and computationally efficient module that enhances model accuracy.FACM ensures more stable feature representations, minimizing image noise impact on segmentation.c.
Experimental comparisons demonstrate that AGSAM consistently outperforms comparative methods across various metrics, showcasing its effectiveness in fewshot scenarios.

SAM and SAM-Med2D
The success of large language models like GPT series models in zero-shot and fewshot scenarios when trained on extensive datasets has translated to the vision domain through SAM [11].Trained on a massive dataset of 11 million images and over a billion segmentation masks, SAM has emerged as a pivotal advancement for segmentation tasks, exhibiting significant potential for few-shot or zero-shot learning across diverse image categories.However, directly applying the pre-trained SAM to medical images reveals limitations in accuracy without additional domain-specific fine-tuning [12].
Thus, SAM-Med2D extended SAM's framework by introducing an adapter mechanism to acquire domain-specific knowledge from a large dataset of approximately 4.6 million medical images across diverse medical devices and 197,000 segmentation masks.This establishes a robust foundation tailored for medical image segmentation applications.

SAM-Based Method
Building upon SAM, a series of innovative methods have been developed to further enhance its capabilities.
AutoSAM goes beyond SAM's feature encoding by integrating an external lightweight task prediction head, aiming to leverage SAM's strong image encoding while extending applicability to a broader range of medical segmentation tasks for fully automated operation [13].nnSAM serves as a plug-and-play solution that merges SAM's embedded features with encoded features from other segmentation models, executing predictions through the other model's decoding section [4].This approach leverages SAM's encoding capabilities while integrating into established segmentation architectures.However, these existing SAM-based methods primarily focus on utilizing SAM's encoding module.As depicted in Figure 1, they essentially embed SAM's encoding into specific models rather than enhancing and optimizing SAM's complete workflow framework.Notably, none of these methods leverage the domain-specific pre-training of SAM-Med2D on massive medical image datasets.Inspired by SAM's strong performance, our objective is to unlock the potential of SAM's comprehensive set of modules, including its extensively trained decoder, for few-shot medical image segmentation by taking full advantage of the SAM-Med2D pre-training on medical data.

SAM-Based Method
Building upon SAM, a series of innovative methods have been developed to further enhance its capabilities.
AutoSAM goes beyond SAM's feature encoding by integrating an external lightweight task prediction head, aiming to leverage SAM's strong image encoding while extending applicability to a broader range of medical segmentation tasks for fully automated operation [13].nnSAM serves as a plug-and-play solution that merges SAM's embedded features with encoded features from other segmentation models, executing predictions through the other model's decoding section [4].This approach leverages SAM's encoding capabilities while integrating into established segmentation architectures.However, these existing SAM-based methods primarily focus on utilizing SAM's encoding module.As depicted in Figure 1, they essentially embed SAM's encoding into specific models rather than enhancing and optimizing SAM's complete workflow framework.Notably, none of these methods leverage the domain-specific pre-training of SAM-Med2D on massive medical image datasets.Inspired by SAM's strong performance, our objective is to unlock the potential of SAM's comprehensive set of modules, including its extensively trained decoder, for few-shot medical image segmentation by taking full advantage of the SAM-Med2D pre-training on medical data.

Architecture Overview
The existing methodologies leverage SAM's encoded features directly by integrating various segmentation prediction heads for automated segmentation, effectively utilizing SAM's encoding component to support other models (Figure 1).In contrast, our proposed approach integrates the pre-trained model's encoder and decode module with the agent model to guide or assist the SAM.The AGSAM framework introduces a strategy by integrating an additional segmentation model as a guiding agent within the SAM architecture (Figure 2).AGSAM's primary objective is to fully exploit SAM's pre-trained image-

Architecture Overview
The existing methodologies leverage SAM's encoded features directly by integrating various segmentation prediction heads for automated segmentation, effectively utilizing SAM's encoding component to support other models (Figure 1).In contrast, our proposed approach integrates the pre-trained model's encoder and decode module with the agent model to guide or assist the SAM.The AGSAM framework introduces a strategy by integrating an additional segmentation model as a guiding agent within the SAM architecture (Figure 2).AGSAM's primary objective is to fully exploit SAM's pre-trained image-featureencoding and prompt-and mask-decoding capabilities to enable autonomous segmentation without relying on manual prompts.By coupling SAM with the guiding segmentation model, AGSAM automates the prompt generation process typically required for SAM, accurately performing segmentation without human intervention or manual prompts.
Bioengineering 2024, 11, x FOR PEER REVIEW 4 of 18 feature-encoding and prompt-and mask-decoding capabilities to enable autonomous segmentation without relying on manual prompts.By coupling SAM with the guiding segmentation model, AGSAM automates the prompt generation process typically required for SAM, accurately performing segmentation without human intervention or manual prompts.

Agent-Guided SAM
The AGSAM architecture consists of two primary modules: the agent model and the pre-trained SAM-Med2D (Figure 2).When an image is input into AGSAM, its features are extracted separately by the agent model and SAM's image feature encoder.This dual-path feature-embedding process enhances the diversity of features.
In this process, the input image , after data pre-processing, is transformed into a tensor with dimensions of 3 × 256 × 256.First, for the feature pathway in SAM, the feature  was computed through the pre-trained encoder module  .Specifically, the image is tokenized by partitioning it, followed by feature extraction and encoding through 12 layers of transformer modules within the Vit-b structure, ultimately resulting in a feature of SAM  with dimensions of 256 × 16 × 16.

𝑓 = 𝑀 (𝐼)
Additionally, the other pathway is determined based on the selection of the agent model.Taking FCN as an example, for the same input image  , the agent encoder  is the feature extract module of FCN.It can encode the input to obtain the feature of the agent  with dimensions of 2048 × 32 × 32.

Agent-Guided SAM
The AGSAM architecture consists of two primary modules: the agent model and the pre-trained SAM-Med2D (Figure 2).When an image is input into AGSAM, its features are extracted separately by the agent model and SAM's image feature encoder.This dual-path feature-embedding process enhances the diversity of features.
In this process, the input image I, after data pre-processing, is transformed into a tensor with dimensions of 3 × 256 × 256.First, for the feature pathway in SAM, the feature f SAM was computed through the pre-trained encoder module M SAM encoder .Specifically, the image is tokenized by partitioning it, followed by feature extraction and encoding through 12 layers of transformer modules within the Vit-b structure, ultimately resulting in a feature of SAM f SAM with dimensions of 256 × 16 × 16.
Additionally, the other pathway is determined based on the selection of the agent model.Taking FCN as an example, for the same input image I, the agent encoder M agent encoder is the feature extract module of FCN.It can encode the input to obtain the feature of the agent f Agent with dimensions of 2048 × 32 × 32.
After obtaining the two features, they are fused by confusion module M CM as follows: f SAM is aligned with f Agent in terms of channels and spatial dimensions through a 3 × 3 convolutional layer Conv 3×3 and a linear interpolation operation F bilinear , resulting in f ′ SAM .Then, they are concatenated along the channel dimension.Finally, a 1 × 1 convolutional layer Conv 1×1 is applied to reduce the dimensionality of the concatenated feature channels.The entire process is represented as: After obtaining f Fused , the agent model utilizes the agent decoder M agent decoder based on this information to decode and obtain the segmentation result Mask agent .Taking FCN [14] as an example, this agent decoder is the mask prediction module in FCN.
Then, the prompt generation module employs the prediction of agent results and confused feature for embedding sparse and dense prompt features, respectively.There are two separate paths for generating sparse prompt embedding and dense prompt embedding (Figure S1).Firstly, for sparse prompt embedding p sparse , the obtained prediction of agent Mask agent undergoes downsampling compression to 1/16 of its original size through pooling layer Avgpool and three FACMs with a stride of 2, FACMs (Table S1).Subsequently, it goes through four FACMs, FACMs, for feature enhancement while maintaining the feature map size.Then, a single 1 × 1 convolutional layer Conv 1×1 is applied to adjust the channel dimensions of the feature map.The channel dimension is updated to n × 256, where n represents the number of segmentation categories.Finally, the tensor is arranged to obtain parse prompt embedding p sparse .
For dense prompt embedding p dense , the entire process is similar to sparse prompt embedding, except that the input is the fused feature f f used instead of the mask Mask agent .Firstly, f f used is resized through interpolation to match the feature map size of Mask agent , resulting in f ′ f used .Then, it undergoes a similar process as before for calculation.Finally, after arranging, dense prompt embedding p dense can be obtained.
After obtaining the two prompt-embedding features, the pre-trained SAM's mask decoder module M SAM decoder is activated to generate the predicted result Mask SAM .At the same time, the prediction of SAM is fused with the prediction of the agent to obtain the fused prediction result Mask pred : where a ϵ [0.0, 1.0].In the study, several empirical values are chosen, such as 0.1, 0.3, 0.5.The overall loss L all during the training of AGSAM is the sum of two parts: the loss L Agent constraint between the prediction of agent Mask agent and the annotation, and the loss L pred constraint between the final fused mask Mask pred and the same annotation.It can be represented as:

Feature Augmentation Convolution Module
To further enhance the model's generalization ability in few-shot scenarios, mitigate the risk of overfitting, and augment feature responses, the feature augmentation convo-lution module (FACM) was introduced.This module introduces a feature enhancement mechanism through random linear suppression, expressed as: Here, α follows a range of [0.25, 1.0], and β ranges from [−10, 10].The FACM inte- grates random linear suppression to smooth and adjust feature information, selectively suppressing less discernible features.The OFF state of FACM status is equivalent to a single convolutional layer.During training phase, the state of FACM can be either ON/OFF and is OFF during the inference phase.This mechanism guides the model toward more contrasted response feature values, resulting in an augmented feature set that enriches the overall information content.This allows for feature response enhancement without adding extra parameters or increasing computational complexity, expressed as:

Experiments 4.1. Datasets
The CAMUS [15] dataset comprises cardiac ultrasound images from 500 patients in apical two-chamber (A2C) and apical four-chamber (A4C) views.Collected from a single vendor (GE Vingmed Ultrasound, Horten Norway) and center (Saint-Etienne University Hospital, France), it represents a highly heterogeneous dataset encompassing varying image quality and pathological cases, representing typical real-world clinical practice data.For each patient, CAMUS provides complete cardiac cycles with manual annotations delineating the end-diastolic (ED) and end-systolic (ES) cardiac structures for each view.As annotations were unavailable for the last 50 patients in the training data, the remaining 450 patients were utilized for training and testing purposes.With four different orientations imaged per patient, the dataset contains a total of 1800 annotated images.The dataset was split into 512 samples for the training set and 88 samples for the validation set, with the remaining 1200 samples used for testing.The annotations include the endocardium and epicardium of the left ventricle, as well as the left atrium wall, serving as the ground truth for both supervised training and evaluation.Accurate segmentation of these regions is crucial for subsequent clinical measurements such as ventricular volume and ejection fraction.
The REFUGE [16] dataset consists of 1200 retinal color fundus photographs (CFPs) acquired from patients in an upright sitting position using one of two devices: a Zeiss Visucam 500 or a Canon CR-2.Manual annotations of the optic disc (OD) and optic cup (OC) boundaries were provided by seven independent glaucoma specialists, serving as the ground truth for both supervised training and evaluation.The first 100 samples (50 for train and 50 for validation) were utilized for training and validation to select the optimal model weights.Subsequently, the remaining 1100 samples were used for testing purposes.Accurate segmentation of the OD and OC regions enables assessment of a patient's risk of glaucoma based on the size ratio between these structures.

Data Pre-Processing
For the CAMUS dataset, the data pre-processing includes converting grayscale images to the 3-channel RGB images, transforming pixel values to integer values ranging from 0 to 255.During training, both images and annotated masks undergo random central rotations of ±10 • .Finally, the images are uniformly resized to 256 × 256 pixels for training.
For the REFUGE dataset, as retinal fundus photographs are inherently RGB threechannel images, there is no need for converting single-channel to multi-channel data.The remaining pre-processing steps are consistent with those used for the CAMUS dataset.

Implementation Details and Evaluation Metrics 4.3.1. Architecture Description
Our AGSAM builds upon the SAM-Med2D method, which employs SAM's Vit-b structure consisting of 12 transformer layers, each with 12 attention heads.The encoding dimension in each layer is set to 768, followed by two convolutional layers to reduce the feature embedding to 256, producing the SAM feature.On the other side, the guiding model is a basic segmentation model with an encoder to decoder structure, taking FCN as an example.Its encoder is the standard ResNet-50 structure before global pooling, resulting in a feature of the agent with dimensions h 8 × w 8 × 2048, where h and w are the height and width of the input image, respectively.
In the fusion module, the feature of SAM was aligned with the feature of the agent through a single convolutional layer, followed by interpolation to match the size of the agent's feature and concatenation along channels.A 1 × 1 convolution layer is then applied for channel fusion, compressing it by half.The fused feature undergoes encoding embedding and mask prediction in the guiding model's decoder.Using FCN as an example, the decoder consists of two convolutional operations to adjust the channel dimensions to the number of segmentation classes, followed by linear interpolation upsampling operation to obtain the predicted mask for the agent model.
The predicted mask and the fused feature are input into the prompt generation module.This module contains two channels, each processing one of the two feature types.The two pathways share a similar structure, involving average pooling for feature downsampling.Seven layers of FACMs are utilized for further feature extraction, with the final 1 × 1 convolution reshaping the channels into 768.Notably, within the seven FACM blocks, only the first layer activates the augmentation block (ON), while the remaining layers do not (OFF), and the convolution blocks in the first three FACMs have a stride of 2, while the subsequent four layers have a stride of 1 (Table S1).The output represents the feature embeddings for the two types of prompts, corresponding to sparse prompts generated from the predicted mask and dense prompts generated from the fused feature.
The FACM within the prompt generation module consists of a convolutional block and an augmentation block employing random linear transformations.During training and when the status is set ON, the augmentation block is activated with a certain probability, suppressing the response values of the convolutional feature map or setting the feature response values to zero with small probability.During inference or when its status is set OFF, it behaves like a standard 3 × 3 convolutional block.
Finally, the fused image feature embedding, along with the generated feature embeddings for two types of prompts, is fed into the mask decoder of SAM.This module directly adopts SAM's flood-decoding module, essentially combining the image feature embedding and various prompt feature embeddings and producing the predicted mask results.The ultimate mask prediction result is a weighted combination of the mask result and the guiding model's predicted mask.

Setting of Training
All methods were implemented using PyTorch, and both training and testing were conducted on a single NVIDIA GPU (RTX 4090 with 24 GB memory).The training process for all methods involved initially augmenting the data to 512 copies, followed by training for 50 epochs.The batch size was set to 8, except nnSAM which used 4 and Mamba-Unet which used 1.Dice loss was selected as the loss function.The optimizer was AdamW, with an initial learning rate of 1 × 10 −4 (except SegFormer was set 1 × 10 −5 ).Starting from the 26th epoch, the learning rate decayed to 1 × 10 −5 (for SegFormer was set 1 × 10 −6 ).The best model for evaluation was selected by the performance on the validation set.

Evaluation Metrics
For evaluation, we employed the dice similarity coefficient (DICE) and Hausdorff distance (HD) as metrics.The DICE assessed the similarity of segmentation results, repre-senting internal overlap [17], while HD focused on the similarity of contour portions in the segmentation results [18].These metrics systematically evaluated the segmentation effectiveness, considering both internal and contour regions.Additionally, we employed additional metrics such as sensitivity, specificity, AUC, and AUPR in the Supplementary Materials for further analysis.

Experimental Settings
To assess AGSAM's performance, we conducted various experiments for comparisons and an ablation study.To ensure fairness in the comparisons, all methods considered in the study were systematically tested by reproducing their core functionalities and utilizing open-source code.Training or fine-tuning was uniformly performed on the research dataset.More details about hyperparameters of experiments and training can be seen in Tables S2 and S3.A p < 0.05 was set to determine significance, and p values were two-sided (t-test).
For the baseline segmentation models, we utilized publicly available and widely used source codes.However, in the case of nnSAM, due to our hardware constraints and the unique configuration of its core network, nnUnet, which differs from other methods evaluated in our study, we followed the description provided in its paper to reproduce the nnSAM method.During this process, we replaced the core network with a lighter network solution and substituted MobileSAM [27] with SAM-Med2D.

Ablation Study
We conducted ablation experiments to validate the effectiveness of integrative modules with one training sample.We used FCN, DeepLabV3, and Unet++ as three segmentation baselines to evaluate the performance of different combinations of the image feature encoder of SAM (FE), mask decoder of SAM (MD), and feature augmentation convolution module (FACM).To verify the effectiveness of AGSAM's strategies, we performed ablations on the structure of the agent model, comparing the impact of different agent model structures between nnSAM and our proposed method.Specifically, based on the proposed method and nnSAM, we used FCN (nnSAM (FCN) vs. proposed (FCN)) and DeepLabV3 (nnSAM (Deep) vs. proposed (Deep)) to compare the two methods.All results were evaluated using DICE and Hausdorff distance (HD) metrics to assess differences in the impact on segmentation region internal and edge regions.Additionally, we included a series of supplementary evaluations to validate the methods proposed in the paper.These evaluations encompassed the computational efficiency of different methods, assessing the influence of additional data augmentation on the results, and comparing the performance of the automatically prompted scheme proposed in the paper with the manually provided coordinate prompting information of the SAM-Med2D method.

Comparison in Few-Shot Scenario with CAMUS
We evaluated each method's performance across varying amounts of training data.As expected, the evaluation metrics showed consistent improvement for all methods as the data volume increased (Tables 1, S4 and S5).Notably, AGSAM consistently outperformed the other approaches, underscoring the effectiveness of fully leveraging the pre-trained SAM.However, this performance differential gradually narrowed with increasing training data volume.This can be attributed to the other approaches having more data available to better fine-tune their models when provided with larger datasets.AGSAM's advantage was most pronounced in limited data scenarios, where its ability to exploit the pre-trained SAM provided significant benefits over methods that rely heavily on data-driven fine-tuning from scratch.
As the data volume increased to a more typical few-shot problem, the adoption of pre-trained weight initialization methods (ResNet-50 pre-trained with ImageNet dataset) such as FCN, DeepLabV3, and PSPNet demonstrated significant superiority over random initialization, as evidenced by the performance comparison with Unet++ (Figures 3 and S2).Comparing DeepLabV3 with AGSAM, noteworthy improvements by AGSAM were observed in all categories except for Endocardium.For example, Epicardium's result was enhanced from 0.4248 to 0.5372, and that of the Left Atrium Wall from 0.4112 to 0.5073, while the Endocardium metric experienced a slight decrease from 0.7112 to 0.6829 (Table S6).This can be attributed to the Endocardium's prevalent central positioning and near-uniform elliptical shape within the images, allowing methods like DeepLabV3 to more effectively learn relevant features even with extremely limited data.As data volume increased, AGSAM outperformed various baselines in achieving more accurate segmentations for all three structures (Figure S3).Ultimately, when trained with 512 samples, despite metrics being close across methods, AGSAM still yielded the best overall results (Tables S7 and S8).
Visual examination of the predicted segmentation results revealed distinctive characteristics (Figure 3).AGSAM exhibited fewer false positive segmentation pixels, especially in limited training data scenarios.This was particularly noticeable for the Left Atrium Wall segmentations, which were poor across all methods until more training data were provided.However, even with increased data, SegFormer continued to struggle with reliable predictions for this category.These observations highlight AGSAM's efficient data utilization in few-shot scenarios and more generalized feature extraction and semantic induction capabilities.uniform elliptical shape within the images, allowing methods like DeepLabV3 to more effectively learn relevant features even with extremely limited data.As data volume increased, AGSAM outperformed various baselines in achieving more accurate segmentations for all three structures (Figure S3).Ultimately, when trained with 512 samples, despite metrics being close across methods, AGSAM still yielded the best overall results (Tables S7 and S8).

Comparison in Few-Shot Scenario with REFUGE
Multiple experiments were conducted on the REFUGE dataset using varying data volumes, following comparative studies for the CAMUS dataset.AGSAM consistently outperformed other solutions in most cases across different evaluation metrics.However, as the dataset volume increased, the performance advantage of the proposed method gradually diminished, aligning with the trends observed in previous experiments (Table 2).Additional indicators and detailed category-wise results are provided in the Supplementary Materials, with AGSAM performing the best for both the optic cup and optic disc (Tables S9-S11 and Figures S4 and S5).
Visual inspection of the predicted segmentation results revealed that even with very limited data, AGSAM exhibited higher sensitivity in detecting the optic cup area within the images (Figure 4).As more data became available, all methods demonstrated noticeable segmentation improvements.However, methods like nnSAM, Mamba-Unet, TGANet, and SegFormer exhibited significant false positives under a few data volumes.In contrast, the results of AGSAM were very stable, with the predicted areas gradually aligning more closely with the annotated ground truth.limited data, AGSAM exhibited higher sensitivity in detecting the optic cup area within the images (Figure 4).As more data became available, all methods demonstrated noticeable segmentation improvements.However, methods like nnSAM, Mamba-Unet, TGANet, and SegFormer exhibited significant false positives under a few data volumes.In contrast, the results of AGSAM were very stable, with the predicted areas gradually aligning more closely with the annotated ground truth.

Ablation Studies of AGSAM
The ablation experiments were conducted with training data of 1 for the CAMUS dataset.At first, the baseline model FCN was tested as the agent model.Successively, SAM feature encoder (FE), which corresponds to nnSAM (FCN), SAM's mask decoder (MD), and the feature augmentation convolution module (FACM) were incrementally added.Metrics revealed improvements with the addition of each module over the baseline.Finally, the combination of all modules resulted in the maximum enhancement from 0.4819 to 0.5419 (Table 3).Metrics demonstrated improvements with each module over the baseline, and the combination of all modules resulted in the maximum enhancement from 0.4819 to 0.5419.Furthermore, we assessed FCN, DeepLabV3, and Unet++ as guided

Ablation Studies of AGSAM
The ablation experiments were conducted with training data of 1 for the CAMUS dataset.At first, the baseline model FCN was tested as the agent model.Successively, SAM feature encoder (FE), which corresponds to nnSAM (FCN), SAM's mask decoder (MD), and the feature augmentation convolution module (FACM) were incrementally added.Metrics revealed improvements with the addition of each module over the baseline.Finally, the combination of all modules resulted in the maximum enhancement from 0.4819 to 0.5419 (Table 3).Metrics demonstrated improvements with each module over the baseline, and the combination of all modules resulted in the maximum enhancement from 0.4819 to 0.5419.Furthermore, we assessed FCN, DeepLabV3, and Unet++ as guided models for AGSAM.Consistent trends observed across methods (e.g., for DeepLabV3 from 0.5157 to 0.5758; for Unet++ from 0.2486 to 0.2615) (Table 3) suggest the effectiveness of AGSAM's modules, rather than reflecting inherent randomness associated with a single model.To assess the efficacy of FACM feature augmentation, we visualized the extracted feature maps [5,6] (Figure 5) using Grad-CAM [28].This visualization technique revealed that after FACM processing, the feature maps exhibited improved alignment and brightness with annotated regions, indicating a better correspondence between the augmented features and the ground truth annotations.This observation suggests that suppressing irrelevant feature responses during training enhances the model's ability to produce relevant responses.
The design intention behind the FACM was to ensure that during supervised training, the feature maps' response values and positions at each layer in the current iteration contain semantic feature information contributing to the final segmentation result.By randomly compressing the values of the feature maps, the model is encouraged to prioritize larger response values in subsequent iterations to counteract the effects of random compression.Additionally, the FACM aims to minimize the inclusion of feature response regions irrelevant to the final prediction.
The visualization results (Figure 5) demonstrate that the integration of the FACM effectively suppresses irrelevant feature information while enhancing feature responses relevant to the prediction results.This enhancement allows the prompt feature embedding to extract relatively reliable information from noisy or rough data, contributing to improved segmentation performance.
Although the guided model may produce unreliable prediction information, leading to unreliable prompt information, the impact of inaccuracies can be mitigated in two ways: (1) the subsequent prompt-embedding module, which can accommodate certain noise information, coupled with the immunity of SAM's decoder to a certain extent of erroneous cues, and the final decoding results correct some of the prediction errors made by the agents used to generate prompt embeddings.(2) The integration of the FACM can suppress some irrelevant feature information and enhance feature responses related to the prediction results, as shown in the figure, enabling the prompt feature embedding to extract relatively reliable information from rough data.The design intention behind the FACM was to ensure that during supervised training, the feature maps' response values and positions at each layer in the current iteration contain semantic feature information contributing to the final segmentation result.By randomly compressing the values of the feature maps, the model is encouraged to prioritize larger response values in subsequent iterations to counteract the effects of random compression.Additionally, the FACM aims to minimize the inclusion of feature response regions irrelevant to the final prediction.
The visualization results (Figure 5) demonstrate that the integration of the FACM effectively suppresses irrelevant feature information while enhancing feature responses relevant to the prediction results.This enhancement allows the prompt feature embedding to extract relatively reliable information from noisy or rough data, contributing to improved segmentation performance.
Although the guided model may produce unreliable prediction information, leading to unreliable prompt information, the impact of inaccuracies can be mitigated in two ways: (1) the subsequent prompt-embedding module, which can accommodate certain noise information, coupled with the immunity of SAM's decoder to a certain extent of erroneous cues, and the final decoding results correct some of the prediction errors made by the agents used to generate prompt embeddings.(2) The integration of the FACM can suppress some irrelevant feature information and enhance feature responses related to the prediction results, as shown in the figure, enabling the prompt feature embedding to extract relatively reliable information from rough data.
AGSAM predictions are a weighted sum of the agent and SAM decoder (Table 4).Varying weights showed higher decoder weights are crucial with limited data, relying on SAM's generalized decoding.As data increase, the agent gains more task experience through training.Consequently, the agent's prediction weights can be gradually AGSAM predictions are a weighted sum of the agent and SAM decoder (Table 4).Varying weights showed higher decoder weights are crucial with limited data, relying on SAM's generalized decoding.As data increase, the agent gains more task experience through training.Consequently, the agent's prediction weights can be gradually strengthened.We believe specialized experience will surpass generalized experience given sufficient data.Fine-tuning the decoder could then further improve accuracy.In terms of computational efficiency comparison, although the proposed method has a larger computational load due to the presence of the SAM, it can still process predictions for 35 samples per second, meeting the requirements for real-time processing (Table S12).Evaluation of the proposed method using FACMs in two different states (ON/OFF) showed no increase in processing time, indicating that FACMs do not add computational overhead.
For the data augmentation experiment, random displacement and random scale scaling were added to the original settings (Table S13).Comparing the results with the previous settings, data augmentation was found to increase the diversity of training data through simulated data, thereby alleviating the scarcity of diversity in few-shot scenarios (Tables S14-S17).
Compared with SAM-Med2D augmented with manual prompts, AGSAM can achieve better results in a fully automatic manner compared to manual prompts (Tables S18 and S19 and Figure S6).AGSAM may provide a solution to enhance the performance of SAM-Med2D for specific medical segmentation applications.

Limitations of AGSAM
However, our current study has some limitations that should be addressed in future work.First, the current research has only been validated on two modalities of data: ultrasound and fundus color photography.Subsequent studies will require more extensive validation experiments on various other modalities of data.The Discussion emphasizes the theoretical training of SAM and SAM-Med2D on multiple data modalities, suggesting avenues for further exploration in validating SAM's feature-encoding and -decoding capabilities on additional datasets.
Another crucial point addressed in the Discussion is the scope of validation scenarios, which, thus far, have focused on few-shot, particularly one-shot, scenarios.The challenges associated with handling entirely new data without any labeled samples are recognized, paving the way for future investigations into prompt learning strategies akin to large language models (LLMs) for achieving zero-shot tasks.
Finally, this paper explores several approaches for model enhancement and incorporates knowledge from general model SAM.However, what is still lacking is the comprehensive utilization of data, such as clinically relevant prior knowledge for segmentation targets and addressing the diversity distribution of data.In the future, we can focus on integrating more prior knowledge to ensure segmentation accuracy and optimize through combinations of data augmentation, synthetic data generation, and various generative models.

Conclusions
This paper introduces AGSAM, a framework leveraging SAM for automatic segmentation with limited training samples.Through an agent-guided model, SAM feature encoding is extracted and fused to generate prompt embeddings for SAM's mask decoder, eliminating the need for manual inputs.The image-feature-encoding module and its fusion module enhance SAM's encoding by integrating the guiding model's capabilities, inheriting SAM's universal feature representation.The prompt generation module automatically guides mask-decoding predictions, generating sparse to dense prompts to comprehensively constrain and avoid redundancy.The mask decoder, based on SAM, combines decoding experience with prompt embeddings for reliable predictions, integrated with the guiding model's own predictions.Superior metrics are attributed to these modules effectively extracting and transferring SAM's pre-trained 'segmentation experience' to AGSAM.Additionally, an online feature enhancement training strategy was explored, suppressing previous model feature responses during training to force improved responses to effective features in the next iteration, boosting encoding ability.Together, these ensure AGSAM benefits from varying training data quantities.The proposed method has potential as a new few-shot benchmark leveraging SAM-like pre-training.

Figure 3 .
Figure 3.Comparison of predictions of different methods with different training samples (n) in CAMUS dataset.

Figure 4 .
Figure 4. Comparison of predictions of different methods with different training samples (n) in REF-UGE dataset.

Figure 4 .
Figure 4. Comparison of predictions of different methods with different training samples (n) in REFUGE dataset.

Table 3 .
Results of different AGSAM-based models with different module combinations in ablation analysis.

Table 4 .
Results of AGSAM with different fusion weight ratio of agent and SAM.