1. Introduction
Strawberry is one of the most economically valuable and widely consumed fruits in the United States, with an annual production of 1.33 million tons in 2021, valued at 3.42 billion [
1]. Meeting this demand is increasingly challenging because of environmental constraints and a shortage of labor. Hence, precision agriculture, particularly the advancement of artificial intelligence (AI), has emerged as a vital strategy to address these challenges, offering advanced technological solutions that improve fruit management, optimize yields, and promote sustainable farming practices [
2]. Computer vision (CV) plays an important role in this paradigm. By allowing the accurate detection and segmentation of fruits, such as strawberries, CV facilitates key agricultural applications, including automated harvesting, yield estimation, and plant health monitoring [
3], and thus helps the grower to reduce labor costs significantly.
In recent decades, many studies have highlighted the relevance of the application of the CV approach for strawberry fruit detection and segmentation. For example, Bashir et al. [
4] developed a YOLOv8-based CV system to detect the four stages of strawberry growth from flowering to ripening and count them in real time. He et al. [
5] developed YOLOv5s-Straw, a customized YOLOv5-based detector, which achieved an mAP of 80.3% across three maturity stages, although it lacked fine-grained segmentation capabilities. Recently, Crespo et al. [
6] proposed an efficient Mask R-CNN pipeline optimized with TensorRT to achieve both high segmentation accuracy and real-time processing speed. Moreover, StrawSnake, introduced by Guo et al. [
7] based on a contour learning approach, produced an mIoU of 81.54%, emphasizing its ability to segment complex boundaries, albeit under supervised learning constraints. Despite their advantages, conventional CV approaches in agriculture are hindered by their heavy reliance on manual annotation. Segmentation requires domain experts to delineate object boundaries at the pixel level. This process is not only time-consuming but also labor-intensive, as it requires technical expertise and effort to ensure accuracy [
8,
9]. This challenge highlights the necessity for methods that can reduce the reliance on manual labeling while ensuring high performance.
Zero-shot learning (ZSL) has emerged as a promising solution for reducing data dependency in agricultural CV tasks [
10]. ZSL enables models to generalize to unseen categories without explicit training on the target dataset by leveraging semantic relationships, textual descriptions, and attribute embeddings [
11,
12,
13]. Foundation models, particularly the Segment Anything Model 2 (SAM2) [
14], have shown excellent performance for segmentation in ZSL settings and are capable of generating pixel-wise boundaries of objects across diverse image domains. However, SAM2 relies heavily on manual expert prompts to achieve class-specific segmentation [
14]. This dependence on human input restricts its automation and scalability, particularly when dealing with large-scale agricultural datasets. To mitigate this limitation, a self-prompting strategy can be employed, in which prompts are generated automatically to guide SAM2 toward object-specific segmentation rather than generic object delineation. Integrating an object detection model, such as YOLO, can generate prompts for the SAM2 model to facilitate object-specific segmentation rather than segmenting all objects from the images.
Recent studies have demonstrated the effectiveness of YOLO-SAM integration for agricultural applications. For instance, Huang et al. [
15] proposed a framework for strawberry canopy segmentation by integrating YOLO8 to guide SAM to produce precise segmentation masks, obtaining a mean intersection over union (mIoU) of 0.749. A plant leaf recognition system was introduced by Zhao et al. [
16], following a similar strategy YOLO8 + SAM), for three different plant species (lilac, field cotton, and mulberry-leaf peony) and attained a detection accuracy of 87% with an inference time of 0.03 s per image. Reddy et al. [
17] applied YOLO7 and YOLO8 with SAM on images captured by unmanned aerial vehicles (UAVs), and their experimental findings demonstrated
of 0.913 for cotton yield prediction. Moreover, a similar approach was adopted by [
18] for mango orchard canopy delineation, applying YOLO7 as a detection model with SAM. These studies consistently reported improved segmentation accuracy, annotation efficiency, and yield estimation reliability. Furthermore, methods such as YOLO–DINO–SAM in panoptic segmentation tasks (PhenoBench dataset) for weeds achieved high PQ+ scores (≈81) [
19], demonstrating the potential of combining detection and foundation models for hierarchical agricultural scene understanding. Although the above-mentioned studies underscore the efficiency of different YOLO versions for object detection, the standard YOLO framework is not inherently zero-shot or few-shot and requires annotated datasets for training. To extend the YOLO capabilities toward ZSL, variants such as YOLO-World were introduced, which incorporate text embeddings and open-vocabulary detection to recognize unseen classes [
20]. However, within our agricultural context, ZSL-based YOLO models exhibit limited performance because of the substantial visual complexity, pervasive occlusions, and the fine-grained distinctions among plant organs, especially in strawberry plants. Few-shot learning (FSL) can provide a practical alternative to address these challenges. FSL fine-tunes pretrained models using a limited number of labeled data, typically ranging from a handful to a few hundred [
21,
22]. By leveraging FSL, the dependence on extensive manual annotations can be substantially reduced, thereby facilitating prompt generation and adaptation in ZSL scenarios.
To address this gap, we propose YOLO-SAM AgriScan, combining ZSL and FSL paradigms together, integrating YOLOv11-based few-shot detection with SAM2-based zero-shot segmentation. YOLOv11 was fine-tuned under the FSL paradigm to detect ripe strawberries by providing bounding-box (Bbox) prompts, which were subsequently passed to SAM2 for high-resolution segmentation. This unified design effectively merges the strengths of both paradigms, leveraging FSL’s adaptability to limited data and ZSL’s generalization ability, to achieve precise, scalable, and annotation-efficient strawberry segmentation.
The key contributions of this study are as follows:
- 1.
An FSL framework for ripe strawberry detection using YOLOv11 is established, trained with varying image counts (50, 100, and 200) and epoch numbers (5, 10, and 20) to evaluate performance under limited data conditions.
- 2.
A self-prompting mechanism is introduced to integrate the trained YOLOv11 with SAM2, enabling ZSL for segmentation. This approach eliminates the need for additional pixel-level annotations and postprocessing by guiding SAM2 to perform object-specific segmentation based on YOLO-detected Bboxes.
The remainder of this paper is organized as follows:
Section 2 details the methodology, covering data acquisition, preprocessing, and model development. The detailed experimental setup is described in
Section 2.5.
Section 3 presents the experimental results of the proposed method. Finally,
Section 4 concludes the paper and outlines directions for future work.
2. Methodology
2.1. Data Acquisition and Processing
To ensure wide applicability, the proposed model was trained and validated on two distinct datasets: our own greenhouse-grown strawberry dataset () and a publicly available field-grown strawberry dataset ().
Greenhouse-grown strawberry data (): This dataset consists of 300 images of greenhouse-grown strawberry plants, each with a resolution of 1280 × 720 pixels. Greenhouse-grown strawberries typically hang from the plant without contacting the soil, as illustrated in
Figure 1a. We potted 114 strawberry plants in standard 1-gallon (3.78 L) nursery pots and cultivated them on tabletops in a greenhouse at the Texas A&M AgriLife Research and Extension Center in Dallas, TX. An Intel
® RealSense™ D435i camera (Intel Corporation, Santa Clara, CA, USA) was used to collect data, targeting on-plant ripe strawberries. Images captured under low-light conditions were enhanced by increasing the brightness by 20% and applying gamma correction (
) to improve the visual quality.
Field-grown strawberry data (): We also adopted a field-grown strawberry dataset collected from Roboflow universe [
23], where strawberry plants and fruits often lie directly on the soil bed, as shown in
Figure 1b. This dataset contains 900 images with a resolution of 3024 × 4032 pixels, similarly targeting ripe strawberries like dataset
.
Dataset
was annotated using CVAT (CVAT.ai Corporation, Wilmington, DE, USA) [
24]. Bbox was used for each ripe strawberry and labeled as ‘ripe’. In addition, we outlined the polygon of the corresponding strawberries to evaluate the segmentation performance of the proposed zero-shot segmentation model. On the other hand, we utilized the annotated label files of dataset
, where ripe strawberries were already labeled as ‘strawberry’.
2.2. Fine-Tuning YOLOv11 for Detection
YOLOv11 was fine-tuned and trained with few-shots for obtaining the prompt Bbox of ripe strawberry. The architecture of YOLOv11 integrates three key components: the C3K2 block, SPPF module, and C2PSA block [
4]. The C3K2 block applies lightweight
convolutions to reduce the number of parameters while retaining high-quality feature extraction. For an image
i with
n candidate regions, the detection process can be formulated as
where
denotes the Bbox parameters,
the predicted class label, and
the confidence score for instance
j.
Transfer learning was also carried out for the pretrained model, as illustrated in
Figure 2. The backbone layers were frozen to preserve previously learned low-level features, such as edges, textures, and shapes, whereas the neck and detection head were fine-tuned using strawberry images.
This approach reduces the training cost, stabilizes the convergence, and prevents overfitting on a limited dataset. Formally, given the model parameters
, where
is the frozen backbone weights and
is the trainable neck and head weights, the training updates are applied on
, carried out by the below equation:
where
is the learning rate, and
is the detection loss that combines the classification, localization, and objectness terms. The gradient
indicates how the loss changes with respect to the trainable parameters, guiding the optimization process during fine-tuning. This ensured that YOLOv11 leveraged pretrained general features while adapting effectively to strawberry-specific detection.
2.3. SAM2 for Segmentation
SAM2 is a segmentation framework developed by Meta. It introduces a unified architecture that combines image and video segmentation capabilities, thereby enabling seamless performance across diverse visual data. In this work, only the image segmentation pathway of SAM2 is used, as our dataset consists entirely of single images. The model leverages large-scale pretraining over 600,000 annotations, allowing it to generalize its zero-shot segmentation capabilities. It can generate segmentation masks based on Bboxes, key points, or textual descriptions. The key components of SAM2 are the (i) image encoder, (ii) prompt encoder, (iii) memory mechanism, and (iv) mask decoder.
The image encoder is built on a transformer backbone and is responsible for extracting rich visual features from data. It interprets the scene at each time step, forming the foundation of the segmentation pipeline. The prompt encoder is designated for user instructions or prompts, guiding the model to focus on particular objects or regions of interest (RoIs). The memory mechanism, combined with a memory encoder, a memory bank, and an attention mechanism, is responsible for temporal consistency. This design allows the model to recall the information from the previous frames. The mask decoder generates the segmentation masks by combining visual features and user prompts, producing accurate delineations of the RoI. Although SAM2 incorporates a memory module for temporal consistency in video segmentation, this component was not utilized in our framework.
2.4. Proposed Hybrid Framework
To obtain pixel-accurate instance masks of ripe strawberries as the target RoI in the FSL and ZSL paradigms, a two-stage pipeline combining a tuned YOLOv11 detector with SAM2 was proposed (Algorithm 1). In this framework, YOLOv11 is responsible for generating the Bbox prompt of the target, and SAM2 extracts the fine-grained mask of the instance inside the Bbox.
Let
be the input images. For each
, YOLOv11 produces candidate Bboxes
(Equation (
3)), which are used as prompts for SAM2. SAM2 fuses image features with prompt features and decodes the final segmentation mask, denoted as
(Equation (
4)).
where
denotes the YOLOv11 detector with parameters
, and
denotes the SAM2 segmenter with parameters
.
| Algorithm 1: YOLO-SAM AgriScan: Bounding-box-prompted ripe strawberry segmentation |
Input: Input image set Output: Segmentation masks - 1
Step 1: Detection - 2
Freeze backbone layers , train detection head - 3
For each image : - 4
Predict Bboxes - 5
where - 6
Step 2: Segmentation - 7
Encode image: - 8
Encode prompts (Bboxes): - 9
Decode segmentation masks: - 10
Output - 11
Return
|
In Step 1 of the algorithm, transfer learning was adopted by freezing the low-level backbone parameters
and training only the task-specific layers
(neck + detection head) on the dataset, as depicted in Equation (
5).
Given
, YOLOv11 predicts a set of
m Bboxes,
, where
. Here,
are the box parameters,
s is the confidence score, and
c is the class (ripe). Non-maximum suppression (NMS) was applied to remove redundant detections. The detector was trained using standard YOLO losses.
In Step 2, SAM2 receives the image
and the bounding boxes
as prompts. The image encoder first computes a dense embedding
. The prompt encoder
then maps each bounding-box prompt into a corresponding embedding
. Because our work utilizes only single images and does not involve video sequences, SAM2’s memory mechanism was not used. Therefore, no recurrent memory state is propagated, and no temporal fusion was applied. The mask decoder directly combines the image embedding
with the prompt embedding
to generate the pixel-level segmentation mask
. The final output is a set of masks
, each delineating ripe strawberries at the pixel level. A schematic of the proposed model is shown in
Figure 3.
2.5. Model Training
We selected YOLOv11m and fine-tuned it for adopting FSL on both and datasets. The input image size was set to 640 × 640 pixels, and the batch size was eight. An extensive data augmentation technique was utilized during training, such as color shifting, rotation, scaling, and translation, to adapt the model to the real-world conditions. AdamW was used as an optimizer while the initial learning rate was set at with a decay factor of . A weight decay of was applied for the regularization. To preserve the pretrained low-level features, the first four backbone layers were kept frozen, whereas the detection head remained trainable to adjust to the characteristics of the strawberry dataset. Early stopping was implemented to prevent the model from overfitting. The training was conducted in a CUDA-enabled Ultra 7 165H processor (1.40 GHz) PC, equipped with 32 GB RAM, and an NVIDIA RTX A500 GPU with 4 GB VRAM.
In the first scenario, we used 80% of the dataset for training and evaluated the effect of different training durations by running the model for 5, 10, and 20 epochs, respectively. This allowed us to examine how the training length influences the detection performance. In the second scenario, we applied an FSL setting by training the detector with subsets of 50, 100, and 200 images randomly sampled from the original training split. This setup was designed to assess the model performance under limited annotated data. In both scenarios, 20% of the data was consistently used as the test set. Based on these experiments, the best-performing detection configuration in term of images used and epochs—training with 50 images for 10 epochs—was selected as the detection backbone for the subsequent ZSL stage for both the
and
datasets as details in
Table 1.
2.6. Performance Evaluation Matrices
Standard metrics were used to assess the proposed methodology for detection and segmentation. For detection, we employed standard precision (P) and recall (R) metrics. We also use mean average precision (), the standard detection metric that computes the average precision across classes and IoU thresholds, reflecting both localization accuracy and correct class prediction.
The PR curve plots the precision against the recall at different detection thresholds, providing a visual representation of the model performance. A larger area under the PR curve (AUC-PR) indicates better detection capability.
As the segmentation approach follows a
zero-shot paradigm, we evaluated the segmentation performance on the test data only using Equations (
6) and (
7).
where
A represents the predicted mask, and
B represents the ground-truth mask.
3. Results and Discussion
3.1. Detection Performance
3.1.1. Performance at Different Epochs
Figure 4 displays performance across different training epochs. The analysis focuses on three metrics: recall (
R), mean average precision (
) at an IoU of
(
), and
over IoU ranging from
to
(
@0.5:0.95).
For
, the model demonstrates a consistent improvement when the number of training epochs increases. At 5 epochs, the
R is measured at
, with an
@0.5:0.95 of
. By 10 epochs,
R reached a perfect score of
, indicating that correspondingly,
improved substantially to
, reflecting accurate object localization. At 20 epochs, the model exhibited signs of performance saturation, with
plateauing at
. However,
@0.5:0.95 increased further to
, suggesting enhanced precision in Bbox localization across stricter IoU thresholds. In contrast, the model’s performance on the
dataset shows a more nuanced trend. Recall initially increases from
at 5 epochs to
at 10 epochs but declines to
at 20 epochs. This decline may indicate overfitting or sensitivity to annotation variability and noise within the dataset. Despite this,
exhibited a steady upward trend, increasing from
to
across the same epoch range, suggesting the continued learning of general object localization patterns. The
metric improved more modestly, from
to
, reflecting incremental gains in fine-grained localization. In addition, the performance was evaluated using
curves and
for different epochs, as shown in
Figure 5. It is evident from the metrics that the model performs slightly better on
than on
, likely due to
’s less complex background.
3.1.2. Performance on Varying Training Sizes
To evaluate the model across different data sizes in the FSL settings, we conducted training with a fixed number of epochs while varying the dataset size. We explored three scenarios using 50, 100, and 200 randomly selected images from the training dataset, with the number of epochs set to 10 for each scenario.
Table 2 displays the performance metrics for input image scenarios, including R,
in
, and
in
.
For dataset , the model performed well even with only 50 training images, achieving a training R value of , an score of , and an score of . On the test set, the performance was slightly lower, with a recall of , of , and of . As the number of training images increased, the test performance consistently improved. With 100 training images, the test recall reached , and with 200 images, it further increased to 0.897. Correspondingly, the improved to , and the increased to . These improvements suggest that the model benefits from additional training data and generalizes more effectively with larger datasets.
A similar pattern was observed for the dataset. With 50 training images, the model achieved a test R of , of , and of . As the training size increases to 100 and then to 200 images, the performance steadily improves. The highest test performance was observed with 200 training images: R of , of , and @0.5:0.95 of . Notably, dataset exhibited stronger test performance at larger training sizes, particularly in terms of .
From
Table 2, we observe that the training results for
are generally lower than those for
, whereas the opposite trend appears on the test set. Although
is visually simpler, its low diversity and repetitive backgrounds limit the model’s ability to learn strong discriminative features under few-shot conditions. In contrast,
contains higher visual variability; therefore, a small randomly selected training subset may not represent the overall distribution and may even include more challenging samples than the test set. This can lead to situations in which the model performs better on the test data than on the limited training subset for
.
Overall, the results demonstrate that the model can achieve high accuracy even with a small number of training samples. Its performance improves as more data are provided, showing good scalability and generalization across datasets. These findings highlight the model’s potential for both few-shot learning scenarios and more robust applications, where larger datasets are available. Based on the above analysis, we selected the FSL model trained for 10 epochs on a 50-image set for the ZSL segmentation pipeline.
Although the 200-image setting achieved the highest detection performance, our ZSL segmentation pipeline is designed for operation under strict annotation constraints. The 50-image, 10-epoch configuration provided the best balance between accuracy, annotation efficiency, and model stability, which aligned with the core objectives of few-shot learning. Notably, this setting already produced strong detection performance (mAP@0.5 > 0.92 on both datasets), which was sufficient for generating reliable bounding-box prompts for the SAM2. Although larger training sets improved the metrics, the additional annotation cost contradicted the motivation of the FSL–ZSL framework. Furthermore, the 10-epoch model consistently avoided underfitting (5 epochs) and overfitting (20 epochs on ), yielding stable predictions that are essential for downstream segmentation.
3.2. Segmentation Performance
To assess the proposed ZSL segmentation performance, the detection backbone was selected and trained on the 50-image training set with 10 epochs for both the
and
datasets. On
, the pipeline achieved a mean DSC of
and a mean IoU of
as shown in
Figure 6, indicating a high level of segmentation accuracy. On the
dataset, the model maintained strong performance with a mean DSC of
and a mean IoU of
.
Comparative discussion with state-of-the-art (SOTA) methods:
Table 3 presents a comparison of the introduced model with different SOTA methods for strawberry segmentation. We considered both quantitative metrics and qualitative visual examples for the evaluation.
We compared the proposed model with the base model of UNet [
25], YOLOv8-seg [
26], and YOLOv11-seg [
27], as these models are well known for their superior segmentation ability. Furthermore, we explored the impact of the SAM family, such as mobile SAM and SAM along with SAM2, on the segmentation task. For a fair comparison, all models were evaluated using their original configurations and trained under Scenario 1 (
Table 1) for 20 epochs for both datasets (
and
).
Table 3.
Quantitative comparison against SOTA methods on and . Best results in bold. Inference time is reported per image (ms).
Table 3.
Quantitative comparison against SOTA methods on and . Best results in bold. Inference time is reported per image (ms).
| Methods | | | Inference Time (ms) |
|---|
| mIoU | mDice | mIoU | mDice |
|---|
| UNet [25] | 0.855 | 0.823 | 0.806 | 0.818 | 145.32 |
| YOLOv8-seg [26] | 0.855 | 0.862 | 0.825 | 0.845 | 123.14 |
| YOLOv11-seg [28] | 0.889 | 0.894 | 0.844 | 0.871 | 94.63 |
| YOLOv8 + SAM1 | 0.930 | 0.915 | 0.911 | 0.890 | 127.07 |
| YOLOv8 + SAM2 | 0.938 | 0.925 | 0.915 | 0.909 | 125.40 |
| YOLOv11 + SAM1 | 0.939 | 0.915 | 0.905 | 0.903 | 88.01 |
| YOLOv11 + mobile-SAM | 0.9510 | 0.9305 | 0.9203 | 0.9122 | 86.60 |
| YOLOv11 + SAM2 | 0.9518 | 0.9310 | 0.9374 | 0.9152 | 80.02 |
| YOLO-SAM AgriScan (Ours) | 0.9519 | 0.9310 | 0.9509 | 0.9213 | 80.02 |
On dataset , the proposed YOLO-SAM AgriScan achieved notable improvements compared to UNet, raising the mIoU from to () and the Dice score from to (). A similar trend is observed for , where the mIoU increases from to () and the Dice from to (). Compared with YOLOLv8-seg and YOLOv11-seg, we also observed improved performance of our model. On , the mIoU improves from to , increasing mIoU by , while the Dice score rises from to (). On , the mIoU increases from to () and the Dice score from to (), showing that integrating SAM2 into the proposed architecture substantially refines the segmentation quality beyond detection-driven baselines.
In comparing the proposed approach with models from the SAM family, we focused on the recent YOLOv11-mobile SAM and YOLOv11-SAM models. The YOLOv11-mobile SAM is a lightweight model that demonstrates strong performance, achieving mean intersection over union (mIoU) and Dice coefficients of and 0.9305 on dataset , and and on dataset . However, YOLO-SAM AgriScan showed improvements, particularly on , with an increase of in mIoU and in the Dice score. This indicates that, while the mobile version effectively balances efficiency and accuracy, the YOLO-SAM AgriScan model achieves superior segmentation precision. We then compared YOLO-SAM AgriScan with its closest variant, YOLOv11-SAM, and observed consistent improvements in performance. The performance on is nearly identical for both models; however, on , the mIoU increases from to (a improvement), and the Dice score rises from to (a improvement). Although the differences were smaller, these results confirmed that YOLO-SAM AgriScan enhanced boundary precision and segmentation consistency across both datasets.
Table 3 also reports the inference times of YOLO-SAM AgriScan compared with the other methods under the same computational conditions. For fairness, all the images were resized to 640 × 640 pixels. The proposed model achieves a competitive inference time of 80.02 ms per image, outperforming traditional segmentation models such as UNet (145.32 ms) and YOLOv8-seg (123.14 ms) and remaining faster than hybrid baselines such as YOLOv8 + SAM1 (127.07 ms) and YOLOv8 + SAM2 (125.40 ms). This balance between accuracy and efficiency highlights the suitability of YOLO-SAM AgriScan for real-time or near-real-time agricultural applications.
In
Figure 7, we present the qualitative segmentation results for strawberries obtained using the proposed framework. For this assessment, we applied both datasets
and
. The YOLO-SAM AgriScan was compared with YOLOv11-mobile-SAM and YOLOv11-SAM. Remarkably, we found that the YOLO-SAM AgriScan framework showed enhanced segmentation results that were almost close to the original ground truth. The findings highlight the efficacy of our framework in achieving precise segmentation with potential applications in automated agricultural analysis without manual annotation.
Figure 8 presents an example from dataset
in which the proposed method failed to segment the target strawberry (marked with red circle). This failure occurred because the detection model could not localize the strawberries in the first place. The likely cause is the higher level of occlusion and visual complexity within
. In contrast, for our controlled greenhouse dataset (
), we did not observe any missing or absent segmentation.
3.3. Potential Applications, Limitations, and Future Works
The proposed YOLO-SAM AgriScan framework has substantial potential for advancing precision agriculture. By integrating FS detection and ZS segmentation, it offers an annotation-efficient approach and reduces the dependence on large, manually labeled datasets, which is a major bottleneck in agricultural AI modeling. This can be deployed for real-time fruit monitoring in greenhouses and open-field environments, enabling automated yield estimation, ripeness assessment, and targeted harvesting. Beyond strawberries, this methodology could be adapted to other fruit crops where visual occlusion remains challenging. The scalability of this approach also positions it for integration into digital twin systems, robotic picking platforms, and intelligent phenotyping pipelines that require consistent instance-level segmentation across diverse environmental conditions.
However, this study is not free from limitations. The segmentation pipeline depends on the accuracy of the YOLOv11-generated Bboxes. As a result, misdetections may propagate to the mask-generation stage. Additionally, model validation was limited primarily to ripe strawberries, and its robustness under varying ripeness stages remains to be comprehensively tested. Future research will focus on enhancing cross-stage generalization by incorporating multiclass maturity detection and spatiotemporal modeling to capture growth dynamics. Further improvements could involve integrating vision–language models for self-adaptive prompting, and 3D multimodal perception combining RGB-D inputs for volumetric segmentation.