Prototype-Guided Promptable Retinal Lesion Segmentation from Coarse Annotations

Yu, Qinji; Ding, Xiaowei

doi:10.3390/electronics14163252

Open AccessArticle

Prototype-Guided Promptable Retinal Lesion Segmentation from Coarse Annotations

by

Qinji Yu

and

Xiaowei Ding

^*

School of Integrated Circuits, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3252; https://doi.org/10.3390/electronics14163252

Submission received: 14 July 2025 / Revised: 9 August 2025 / Accepted: 14 August 2025 / Published: 15 August 2025

(This article belongs to the Special Issue AI-Driven Medical Image/Video Processing)

Download

Browse Figures

Versions Notes

Abstract

Accurate segmentation of retinal lesions is critical for the diagnosis and management of ophthalmic diseases, but pixel-level annotation is labor-intensive and demanding in clinical scenarios. To address this, we introduce a promptable segmentation approach based on prototype learning that enables precise retinal lesion segmentation from low-cost, coarse annotations. Our framework treats clinician-provided coarse masks (such as ellipses) as prompts to guide the extraction and refinement of lesion and background feature prototypes. A lightweight U-Net backbone fuses image content with spatial priors, while a superpixel-guided prototype weighting module is employed to mitigate background interference within coarse prompts. We simulate coarse prompts from fine-grained masks to train the model, and extensively validate our method across three datasets (IDRiD, DDR, and a private clinical set) with a range of annotation coarseness levels. Experimental results demonstrate that our prototype-based model significantly outperforms fully supervised and non-prototypical promptable baselines, achieving more accurate and robust segmentation, particularly for challenging and variable lesions. The approach exhibits excellent adaptability to unseen data distributions and lesion types, maintaining stable performance even under highly coarse prompts. This work highlights the potential of prompt-driven, prototype-based solutions for efficient and reliable medical image segmentation in practical clinical settings.

Keywords:

prototype learning; retinal lesion segmentation; coarse annotation refinement

1. Introduction

Accurate segmentation of retinal lesions in fundus images is crucial for the diagnosis and management of ocular diseases such as diabetic retinopathy (DR). In recent years, deep learning models, particularly those based on U-Net [1] and its variants [2,3,4], have achieved remarkable, even expert-level, performance in this task. However, the success of these fully supervised approaches critically depends on the availability of large-scale, meticulously annotated datasets in which every lesion pixel is precisely delineated. Constructing such datasets remains a well-recognized bottleneck in medical imaging; it demands substantial manual effort from expert ophthalmologists, resulting in high costs and inefficiencies. This significantly hinders the widespread clinical adoption of these advanced deep learning models. A pragmatic alternative to labor-intensive, pixel-wise annotation is the use of coarse annotations, where clinicians provide simple geometric shapes—such as circles or ellipses—to approximate lesion locations. This approach is over six times more efficient than detailed pixel-level labeling [5]. While coarse prompts effectively capture the core location and approximate scale of lesions, they inevitably lack boundary precision. This limitation raises a critical research question: how can we harness the efficiency of coarse annotations to achieve the segmentation accuracy required for clinical practice?

To address this challenge, we propose a novel promptable segmentation paradigm based on prototype learning for accurate retinal lesion segmentation with low-cost, coarse annotations. As illustrated in Figure 1, in this framework, the provided coarse mask (such as an elliptical contour) is utilized as a prompt to guide the model to segment the actual lesion region. Specifically, our method accepts both a retinal image patch and its corresponding coarse prompt as inputs. By leveraging a lightweight, U-Net-inspired feature extraction backbone, the network effectively integrates image content with the spatial prior embedded in the prompt. Through prototype learning, the model initially computes feature prototypes for both foreground (lesion) and background directly from the prompted region. Final lesion segmentation is then dynamically determined at the pixel level by evaluating the learned similarity between each pixel’s feature and the respective foreground/background prototypes. Because our prototypical approach generates image-specific prototypes that adaptively characterize each image, it exhibits enhanced robustness to intra-class variance and large distribution shifts across different datasets or previously unseen classes.

However, the inherent imprecision of coarse annotations presents a significant challenge for prototype extraction. In the case of small or irregularly shaped lesions, coarse prompts often encompass substantial background areas. When the foreground prototype is generated by simply aggregating features (e.g., through average pooling) within the entire prompted region, a considerable amount of background information is incorporated, resulting in “prototype dilution”. This significantly undermines the accuracy and discriminative power of the lesion prototype. Thus, we introduce an innovative superpixel-guided prototype weighting module. Within the coarse prompt region, this module first employs superpixel segmentation to divide the area into multiple subregions based on local feature similarity, computing a representative sub-prototype for each subregion. Subsequently, the distinctiveness of each sub-prototype from the global background prototype is evaluated, and an adaptive weight is assigned—subregions that are more dissimilar to the background (and therefore more likely to correspond to true lesion) receive higher weights. Finally, all sub-prototypes are fused in a weighted manner to form a foreground prototype that is more robust to background interference and more accurately captures the true characteristics of the lesion.

Our main contributions are as follows:

We propose a novel prototype-guided promptable segmentation framework that effectively refines coarse retinal lesion annotations.
We introduce a superpixel-based prototype weighting mechanism to counteract prototype dilution and enhance feature discrimination.
We conduct comprehensive experiments on three fundus lesion datasets (IDRiD [6], DDR [7], and a private dataset), substantially improving upon the initial coarse annotations and achieving state-of-the-art performance compared to both non-prototypical mask refinement baselines and existing general medical promptable segmentation models.

2. Related Work

2.1. Retinal Lesion Segmentation

Retinal image analysis is crucial for ophthalmic disease diagnosis, monitoring, and treatment planning. Precise segmentation of retinal lesions (e.g., microaneurysms, hemorrhages, exudates, drusen) provides objective evidence for early screening and assessment of blinding diseases such as diabetic retinopathy and age-related macular degeneration [8,9]. Early approaches relied on handcrafted feature engineering and classical image processing methods such as vessel topology extraction and texture analysis, combined with thresholding or region growing. For example, Niemeijer et al. [10] leveraged Gaussian-based matched filtering for microaneurysm detection. However, these methods often fail to generalize robustly due to large variations in lesion scale, blurred boundaries, and morphological diversity.

With advances in deep learning, U-Net and its derivatives have become the main solutions for retinal lesion segmentation. The U-Net [1] encoder–decoder structure with skip connections enables multi-scale feature extraction. For complex lesions, attention mechanisms and feature fusion have been widely explored. Wang et al. [11] introduced a dual-branch attention network, while Zhou et al. [12] utilized hierarchical guidance via attention heat maps. Nonetheless, bottom-up global fusion can suffer from locational inaccuracies for tiny lesions. Liu et al. [2] enhanced small lesion detection via local feature modules, but at the cost of slower convergence for large lesions. Ding et al. [4] combined high-res local cropping and global context, achieving accurate segmentation across diverse scales. Despite these advances, reliance on large expert-annotated datasets remains a bottleneck.

2.2. Prototype Learning

Prototype learning aims to extract discriminative prototypes to represent each class, enabling classification or segmentation via similarity matching [13,14]. This paradigm, intuitively aligned with human reasoning and offering good interpretability, is especially attractive for few-shot and cross-domain medical segmentation [15,16,17]. Zhang et al. [15] proposed self-aware cross-sample prototype learning for semi-supervised segmentation, leveraging limited annotations through dynamic prototype construction. Zhou et al. [14] embedded prototypes within hybrid networks for improved breast tumor segmentation, while Liu et al. [17] used spatial–semantic prototype affinity for pseudo-label optimization in weakly supervised learning, boosting the model’s reliability in ambiguous boundary regions. Inspired by these successes, our work innovatively adapts prototype learning to a prompt-driven framework for refining coarse annotations. We specifically introduce a superpixel-guided mechanism to tackle the unique challenge of prototype dilution caused by imprecise clinical prompts, which has not been addressed by prior work.

2.3. Promptable Segmentation

Promptable segmentation leverages interactive prompts (points, boxes, scribbles, etc.) to dynamically guide the segmentation process. The Segment Anything Model (SAM) [18] established a generic, prompt-driven segmentation framework using a massive pretraining dataset, demonstrating strong zero-shot generalization. In medical imaging, prompt-based segmentation has become an active focus.

Dedicated models have emerged to address medical needs: SAM-Med2D [19] adapts the original SAM for multimodal 2D medical images, while MedSAM [20] scales training with 1.5 million medical segmentations for improved performance. ScribblePrompt [21] introduces flexible 2D prompting modes (point/box/scribble) for various medical tasks. For 3D imaging, SAM-Med3D [22] and NVIDIA VISTA [23] extend promptable segmentation to volumetric data. SegVol [24] integrates 3D spatial prompts and textual descriptions for broad multimodal applications. These developments collectively highlight the evolving landscape of prompt-driven medical image segmentation.

3. Methods

This section details our prototype-based promptable segmentation approach for retinal lesion detection, as illustrated in Figure 2. The core objective is to efficiently transform the coarse mask prompt (e.g., an elliptical annotation) into an accurate pixel-wise segmentation.

To achieve this, our method comprises three main steps. First, a coarse mask prompt simulation procedure (Section 3.1) generates training pairs with simulated coarse prompts from finely annotated data. Second, we construct a prototype learning-based refinement network (Section 3.2), which integrates image features and prompt information to learn discriminative foreground and background prototypes for initial segmentation. Finally, to mitigate background interference within coarse prompts, a superpixel-guided prototype weighting module (Section 3.3) is introduced to enhance foreground prototype discrimination and overall segmentation accuracy.

3.1. Coarse Mask Prompt Simulation

To effectively train the promptable segmentation network, a large volume of retinal image and corresponding coarse mask prompt pairs are required. Direct manual annotation at scale is costly; therefore, we simulate realistic coarse prompts from existing pixel-level precise annotations. This simulation mimics the common clinical practice of marking lesions quickly with simple geometric shapes (e.g., ellipses). The main steps are as follows (illustrated in Figure 3):

1.: Preprocessing and Region Dilation: Each binary lesion mask is first dilated using a morphological operation, simulating the looser coverage typical of coarse annotation and merging adjacent small lesions to reduce the number of separate regions.
2.: Connected Component Analysis and Centroid Extraction: We perform connected component analysis on the dilated mask to identify all $N_{c}$ foreground regions. For each region, the centroid coordinate is calculated as an input for clustering.
3.: Centroid-based Clustering: Since lesions often cluster spatially, we apply KMeans clustering to the centroids to merge nearby components, reflecting how ophthalmologists may annotate spatially adjacent lesions as a single region. The cluster number is dynamically determined as

$N_{cluster} = min (⌈ N_{c} / r ⌉, N_{\max - cluster})$

(1)

where r is a reduction factor and $N_{\max - cluster}$ is the maximal cluster number.
4.: Ellipse Fitting: For each cluster, we fit a minimum-area enclosing ellipse, then randomly scale its axes by 10% to 50% to further mimic the imprecision of coarse clinical annotation.
5.: Training Pair Construction: Based on each fitted ellipse, a fixed-size image patch is cropped from the original fundus image with a matching binary mask, yielding the final “image patch/coarse mask prompt” pairs for training.

This automated pipeline efficiently converts fine-grained pixel-level datasets into abundant, realistic training pairs with simulated coarse prompts, providing a reliable foundation for prompt-based segmentation research. The generated prompts closely align with the morphological features of rapid human annotation while incurring minimal additional cost.

3.2. Prototype-Based Coarse Annotation Refinement

3.2.1. Feature Fusion

Given paired retinal image patches

I \in R^{H \times W \times 3}

and the corresponding coarse lesion mask

M \in R^{H \times W \times 1}

, we concatenate them along the channel dimension to obtain an

H \times W \times 4

input. A lightweight U-Net [1] backbone (following [25]) extracts features, with reduced upsampling layers to lower computation. The output deep feature

F \in R^{H_{0} \times W_{0} \times C}

(

H_{0} = H / 4, W_{0} = W / 4

) is concatenated with a resized mask

M^{'} \in R^{H_{0} \times W_{0} \times 1}

to form an intermediate feature of size

H_{0} \times W_{0} \times (C + 1)

. This is followed by a lightweight fusion module: a

1 \times 1

convolution (reducing channels to

C_{0}

), batch normalization, and ReLU activation, resulting in a fused feature

F^{'} \in R^{H_{0} \times W_{0} \times C_{0}}

. This design jointly encodes both local detail (e.g., fine vessel/lesion structure) and global guidance (prompt region), facilitating downstream prototype extraction.

3.2.2. Prototype Extraction

Prototype learning is adopted to capture discriminative features for foreground (lesion) and background. Firstly, foreground prototype

μ_{1}

is computed as the mean of fused features within the mask:

μ_{1} = \frac{\sum_{(x, y)} F^{'} (x, y) I [M^{'} (x, y) = 1]}{\sum_{(x, y)} I [M^{'} (x, y) = 1]}

(2)

Background prototype

μ_{0}

is similarly calculated over

M^{'} (x, y) = 0

:

μ_{0} = \frac{\sum_{(x, y)} F^{'} (x, y) I [M^{'} (x, y) = 0]}{\sum_{(x, y)} I [M^{'} (x, y) = 0]}

(3)

3.2.3. Prototype-Based Segmentation

For each position

(x, y)

in

F^{'}

, we compute its distance to both prototypes and convert it into a probability map

P_{c}

(

c \in {0, 1}

, background/foreground) via softmax normalization:

P_{c} (x, y) = \frac{exp (- α \cdot d (F^{'} (x, y), μ_{c}))}{\sum_{c^{'} \in {0, 1}} exp (- α \cdot d (F^{'} (x, y), μ_{c^{'}}))}

(4)

where

d (\cdot, \cdot)

is a distance metric (e.g., Euclidean), and

α

is a scaling factor (fixed to 20) to sharpen the probability distribution.

3.2.4. Training and Inference

The model is trained end-to-end with a composite loss:

L_{loss} = L_{dice} + L_{bce}

(5)

where

L_{dice}

and

L_{bce}

denote the Dice and binary cross-entropy losses between predicted

P_{1}

and ground truth

M_{g t}

.

At inference, each input patch is predicted as a foreground probability map. For patches from overlapping regions in the original image, the final probability is taken as the maximum across all overlapping predictions. Applying a fixed threshold yields the final binary lesion mask.

3.3. Superpixel-Guided Prototype Weighting

As shown in Figure 4, when the true lesion area is much smaller than the coarse mask, the foreground prototype computed via Equation (2) is easily contaminated by background pixels. To address this, we introduce a superpixel-guided foreground prototype weighting strategy to alleviate background interference within the prompt region.

Inspired by the maskSLIC algorithm [26], we first partition the prompt region into multiple superpixels based on feature similarity. For each superpixel

S_{i}

, a sub-prototype vector

g_{i}

is extracted as in Equation (2). All sub-prototypes form the set

G = {g_{i}}

, where

i = 1, 2, \dots, N_{s p}

, and

N_{s p}

is the number of superpixels. To adaptively determine

N_{s p}

based on the prompt’s scale, we set

N_{s p} = min (⌊\frac{\sum_{(x, y)} M^{'} (x, y)}{S_{s p}}⌋, N_{\max - sp})

(6)

where

\sum_{(x, y)} M^{'} (x, y)

is the pixel count in the prompt mask,

S_{s p}

is the expected area per superpixel (empirically set to 100), and

N_{\max - sp}

is capped at 20 to keep computation tractable.

To assess the distinctiveness of each sub-prototype from the background, we compute the cosine distance between each

g_{i}

and the background prototype

μ_{0}

:

w_{i} = 1 - \frac{g_{i} \cdot μ_{0}}{∥ g_{i} ∥ \cdot ∥ μ_{0} ∥}

(7)

where

∥ \cdot ∥

denotes L2 norm. The weights are then normalized via softmax:

w_{i}^{'} = \frac{exp (β \cdot w_{i})}{\sum_{g_{i^{'}} \in G} exp (β \cdot w_{i^{'}})}

(8)

Here, we follow [27,28] to set the scaling factor

β = 10

to sharpen the weighting effect.

The final weighted foreground prototype is computed as

μ_{1}^{'} = \sum_{g_{i} \in G} w_{i}^{'} \cdot g_{i}

(9)

In subsequent segmentation computations,

μ_{1}^{'}

replaces the original

μ_{1}

in Equation (4). Experimental results demonstrate that this superpixel-guided weighted prototype more accurately captures discriminative lesion characteristics, especially in cases where the coarse prompt severely overestimates lesion size, thus significantly improving segmentation performance.

4. Experiments and Results

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

We conducted comprehensive evaluations on three datasets, including two public retinal lesion datasets (IDRiD [6] and DDR [7]) and one private clinical dataset:

IDRiD: Comprises 81 color fundus images (54 for training, 27 for testing) with a resolution of 4288 × 2848 pixels. Pixel-level annotations cover four lesion types associated with diabetic retinopathy: hard exudates (EXs), hemorrhages (HEs), microaneurysms (MAs), and soft exudates (SEs).
DDR: Consists of 757 fundus images (383 for training, 149 for validation, 225 for testing), with resolutions ranging from 1088 × 1920 to 3456 × 5184 pixels. Lesion classes are the same as in IDRiD.
Private Clinical Dataset: Includes 211 color fundus images with pixel-level annotations of preretinal hemorrhage (Prh) and drusen, labeled by two experienced ophthalmologists.

The overall model training is conducted using the training split of the IDRiD dataset, which contains 54 fundus images. Applying the coarse mask prompt simulation algorithm described in Section 3.1, a total of 3320 valid training image–mask pairs are generated from these images. These pairs comprehensively cover all four key diabetic retinopathy lesion types: hard exudates (EXs, 30.2% of samples), hemorrhages (HEs, 23.5%), microaneurysms (MAs, 43.6%), and soft exudates (SEs, 2.7%). To address the scarcity of SE lesions, an oversampling strategy is employed to increase their effective representation in the training set, thereby enhancing the model’s sensitivity and learning capacity for this rare category.

For evaluation, the 27 IDRiD test images, 225 DDR test images, and all images from the private dataset are preprocessed using the same mask prompt simulation pipeline as in training. This standardized preprocessing ensures consistency between training and testing data, enhances comparability across datasets, and enables a fair assessment of the model’s capability in refining coarse fundus lesion annotations.

4.1.2. Evaluation Metrics

Following previous work [2,3,4], the Intersection over Union (IoU) is adopted as the primary evaluation metric for lesion segmentation performance in fundus images.

4.2. Implementation Details

All experiments were conducted using the Adam optimizer with a batch size of 64, an initial learning rate of

1 \times 10^{- 4}

, and a weight decay of

1 \times 10^{- 4}

. The model was trained for 120 epochs. Learning rate scheduling was handled by the ReduceLROnPlateau strategy to adaptively adjust the learning rate for better convergence and performance.

To enhance generalization, various data augmentation techniques were applied during training, including random translation, scaling, rotation, blur, and random adjustments of brightness and contrast. The maximum number of superpixels

N_{m a x - s p}

was fixed at 20. All input image–prompt pairs were resized to

256 \times 256

before being fed into the model.

4.3. Comparison Methods

To comprehensively evaluate the proposed method for coarse lesion annotation refinement, we conducted extensive comparative experiments covering both conventional fully supervised approaches and the latest promptable segmentation techniques.

Firstly, we benchmarked our model against several state-of-the-art fully supervised fundus lesion segmentation methods [2,4], which do not utilize any mask prompt information. These serve as baselines to assess the achievable segmentation performance without auxiliary prompt guidance.

Secondly, we implemented multiple non-prototypical segmentation models as baseline promptable methods. These models take both the image patch and its corresponding mask prompt as input, extract features via a prompt fusion module (identical to Section 3.2 except for the backbone choice), and employ a

1 \times 1

convolution as the segmentation head to generate refined probability maps. For a thorough evaluation, three popular backbone networks were explored: ResNet-18 [29], HRNet-18 [30], and U-Net [1]. This diversity allows us to investigate the impact of backbone architectures on segmentation performance.

Lastly, we compared our approach to 2D medical promptable segmentation models, specifically SAM-Med2D [19] and MedSAM [20]. Since these models do not support elliptical prompt masks, we used the lesion’s bounding box as the input prompt for a fair comparison and evaluation of segmentation performance.

4.4. Results on Each Dataset

4.4.1. Results on IDRiD Dataset

We first train all segmentation models on the IDRiD training set, covering all four major lesion types (MA, SE, EX, HE), and evaluate their coarse prompt refinement performance on the IDRiD test set. The quantitative results are shown in Table 1. Several key observations are summarized as follows:

Both our prototype-based promptable segmentation method and other prompt-based baselines significantly outperform the initial coarse mask prompt across all lesion categories, demonstrating substantial improvement in segmentation accuracy.
Without prompt guidance, traditional fully supervised approaches (e.g., HRDecoder) achieve only 32.6% IoU on the highly challenging MA class, much lower than for other lesions. This highlights the difficulty of detecting tiny lesions and the effectiveness of spatial prompts in enhancing detection.
Our method outperforms non-prototypical baselines for all four lesion types. For example, our method achieves 84.3% IoU for MA, surpassing HRNet18 (79.1%) and U-Net (77.6%). On average, the prototype-based approach yields an improvement of over 5.2% for all lesions.
General-purpose medical promptable models such as MedSAM and SAM-Med2D perform less effectively on fundus lesion segmentation. For instance, on EX, MedSAM achieves only 56.0% IoU, whereas our approach attains 66.9%. This suggests that such models are not specifically optimized for the unique challenges of fundus lesion segmentation, notably the fine-grained, tiny, and diffusive lesion patterns with low contrast and blurred boundaries.
The superpixel-guided prototype weighting module delivers significant performance gains for block-like, large-area lesions (e.g., HEs), but has a limited effect for spot-like micro-lesions (e.g., MAs). For small lesions, the prompt mask is typically small and may default to a single prototype, while larger lesions benefit from multiple superpixels and effective weighting, resulting in more accurate prototypes and improved segmentation.

Table 1. Comparisons with other methods on the IDRiD dataset. For overall comparison, the best results are highlighted in bold, the second-best are underlined, and our results are displayed in gray.

Methods	IDRiD (IoU (%) ↑)
Methods	MA	SE	EX	HE	Mean
Coarse Mask Prompt	9.6 ± 2.6	49.3 ± 10.4	15.0 ± 5.2	33.2 ± 8.6	26.8 ± 6.7
M2MRF [2]	31.4 ± 8.3	52.3 ± 12.1	64.6 ± 9.6	48.5 ± 11.2	49.2 ± 10.3
HRDecoder [4]	32.6 ± 7.9	56.1 ± 14.5	65.0 ± 10.1	49.8 ± 10.9	50.9 ± 10.9
ResNet18 [29]	73.9 ± 7.3	68.4 ± 14.2	54.1 ± 9.4	62.6 ± 10.9	64.7 ± 10.5
HRNet18 [30]	79.1 ± 7.5	78.1 ± 5.3	56.8 ± 9.1	64.8 ± 11.8	69.7 ± 8.4
U-Net [1]	77.6 ± 6.8	75.6 ± 12.6	58.9 ± 8.9	67.9 ± 11.4	70.0 ± 9.9
MedSAM [20]	60.7 ± 7.1	59.6 ± 11.3	56.0 ± 13.5	62.2 ± 9.7	59.6 ± 10.4
SAM-Med2D [22]	57.9 ± 7.6	61.1 ± 13.2	51.2 ± 14.7	65.9 ± 11.0	59.0 ± 11.6
Ours w/o superpixel	84.2 ± 6.2	80.7 ± 9.2	65.3 ± 8.1	69.9 ± 13.9	75.0 ± 9.4
Ours	84.3 ± 6.1	81.4 ± 9.6	66.7 ± 8.4	72.0 ± 11.3	76.1 ± 9.4

4.4.2. Cross-Dataset and Cross-Class Results

To further verify model generalization, we directly evaluate the model trained on the IDRiD dataset on the DDR test set and our private clinical dataset (containing two novel lesion types: Drusen and Prh) without any fine-tuning. Results are shown in Table 2 and Table 3, with key findings summarized below.

On the DDR dataset, cross-dataset evaluation reveals that fully supervised baselines (M2MRF, HRDecoder) experience a sharp drop in mean IoU (to 29.0% and 29.6%, respectively, over 20% lower than on IDRiD), indicating strong sensitivity to data distribution shift. Our method achieves a mean IoU of 66.3%, outperforming HRNet18 (61.5%) and U-Net (61.6%) by 4.8% and 4.7%, with only a 10.2% drop from IDRiD. This demonstrates much greater stability under domain shift, attributable to per-instance prototype modeling of lesion semantics.

On the private dataset, cross-class evaluation shows our method achieves 49.6% IoU for Drusen and 62.7% for Prh, significantly surpassing MedSAM (40.1%, 50.7%) and U-Net (45.2%, 57.9%). The superpixel-guided prototype weighting module yields a notable improvement for the irregular, large-area Prh class, with a 6.4% performance gain over the global prototype baseline, highlighting its effectiveness.

Overall, the results demonstrate that our promptable segmentation network not only achieves excellent performance on in-distribution data, but also maintains robust accuracy when applied to new datasets and previously unseen lesion types, confirming its strong generalization capabilities.

4.4.3. Qualitative Analysis

Figure 5 presents qualitative comparisons that intuitively illustrate the differences and characteristics of various segmentation methods across lesion types. In the first two rows, the non-prototypical U-Net baseline exhibits prominent boundary discontinuities and fragmented mask predictions for Prh. Models using only global prototypes achieve more contiguous results, but tend to over-segment, failing to accurately fit real lesion contours. In contrast, our superpixel-guided prototype weighting method produces more precise boundaries, demonstrating superiority for complex lesion shapes and blurred edges. It is also worth noting that for some large, diffuse lesions like the Prh example (third row), our method, while more contiguous than the U-Net baseline, still exhibits some over-segmentation compared to the ground truth. This may occur because the true boundaries of such lesions are inherently ambiguous, even for human experts.

The last row of Figure 5 shows a challenging case from the private dataset, where the input prompt covers both a familiar lesion (EX) seen in training, and an unseen lesion (Drusen). All compared methods, including ours, predominantly segment the familiar EX region, while often ignoring or misclassifying the previously unseen Drusen area. This highlights the inherent bias of models toward frequent lesion types in the training data and points to areas for future improvement in handling rare or novel lesions.

4.4.4. Sensitivity Analysis of Maximum Number of Superpixels

To evaluate the impact of the maximum number of superpixels,

N_{m a x - s p}

, on our model’s performance, we conducted a sensitivity analysis on the IDRiD dataset. The results are presented in Table 4. The analysis reveals that the superpixel-guided weighting module consistently enhances segmentation accuracy. When

N_{m a x - s p}

is set to 1, the entire prompt region is treated as a single superpixel, which is equivalent to our model without the weighting mechanism. As

N_{m a x - s p}

increases from 1 to 20, the mean IoU steadily improves from 75.0% to 76.1%, demonstrating the benefit of partitioning the coarse prompt region to refine the foreground prototype. However, the performance gains begin to saturate at higher values. Increasing

N_{m a x - s p}

from 20 to 30 yields only a marginal improvement of 0.1% in mean IoU (from 76.1% to 76.2%). This suggests that subdividing the prompt into more than 20 superpixels provides diminishing returns. Therefore, we chose

N_{m a x - s p} = 20

for our main experiments, as it offers an excellent trade-off between segmentation performance and computational efficiency. This analysis confirms that our model is robust to the choice of this hyperparameter within a reasonable range.

4.4.5. Generalization to Box Prompt

Despite our focus on elliptical prompts, the model’s architecture is inherently flexible. Technically, our model processes any geometric prompt by first converting it into a binary mask for feature fusion. To explicitly validate this, we conducted an experiment using box prompts. We modified our coarse mask simulation pipeline by replacing the “Ellipse Fitting” step with a “Bounding Box Fitting” step and retrained the model. The results, presented in Table 5, demonstrate two key findings. First, the initial IoU of the coarse elliptical prompts is consistently higher than that of box prompts across all datasets. This quantitatively supports our observation that ellipses offer a better initial fit for the contours of lesion clusters compared to bounding boxes. Second, while our model achieves peak performance with elliptical prompts, it remains highly effective with box prompts, showing only a minor performance decrease of approximately 2–3%. For instance, on the IDRiD dataset, the model still improves the IoU from a coarse 23.4% to a refined 74.0%. This confirms that our framework is robust and can generalize effectively to different prompt shapes, even if they are less optimal than ellipses.

4.4.6. Effect of Prompt Coarseness on Segmentation Performance

In clinical scenarios, ophthalmologists often employ coarse elliptic annotations to quickly cover multiple, disconnected lesion regions, introducing varying degrees of annotation coarseness. To systematically study how this affects segmentation performance, we control the level of prompt coarseness in our simulation pipeline by adjusting the K-Means reduction factor r. As r increases, the number of generated elliptical mask prompts per image decreases, and each prompt tends to cover a larger area—potentially including more lesions and greater background, thus increasing prompt coarseness. As shown in Figure 6, larger r values result in fewer, broader prompts and reduced prompt precision.

We evaluate the proposed method (with and without superpixel guidance) and three non-prototypical baselines (U-Net, HRNet18, ResNet18) on IDRiD, DDR, and the private dataset across a range of r values (from 1.0 to 2.0, step 0.1). Results in Figure 7 indicate that, while all methods experience some performance drop as prompt coarseness increases, our prototype learning framework exhibits remarkable robustness. Even under highly coarse prompts (

r = 2.0

), our method maintains high IoU scores (over 50% on IDRiD and private, and around 50% on DDR), consistently outperforming the U-Net baseline by 8–11 percentage points. These findings demonstrate the strong adaptability and superior segmentation performance of our framework—particularly with superpixel-guided prototype weighting—under varying levels of prompt quality.

5. Discussion and Conclusions

In this study, we proposed a prototype-guided promptable segmentation framework for refining coarse retinal lesion annotations into accurate pixel-level masks. By leveraging prototype learning and superpixel-guided weighting, our method can adaptively model the complex feature distributions of retinal lesions using only coarse spatial prompts such as ellipses—substantially reducing the annotation burden in clinical workflows. Through systematic experiments on three datasets (IDRiD, DDR, and a private clinical set), we demonstrated that the proposed approach consistently outperforms fully supervised networks and non-prototypical prompt-based baselines, with superior robustness to prompt coarseness, cross-dataset domain shift, and previously unseen lesion categories. Notably, the superpixel-guided prototype weighting module proved especially valuable for segmenting large and irregular lesion regions, further highlighting the flexibility and adaptability of the prototype-based framework.

For practical deployment, our framework could be integrated into existing Picture Archiving and Communication Systems (PACSs) or ophthalmology imaging software as a segmentation-assistance tool. The clinical workflow would involve an ophthalmologist drawing a quick, coarse ellipse around a region of interest, which our model would then process in near real-time to generate a refined, pixel-level mask for quantitative analysis or disease monitoring. However, the integration of such AI tools necessitates a discussion of potential risks. It is essential to mitigate the risk of automation bias, where clinicians might over-rely on the model’s output. Therefore, the proposed tool should be implemented as a decision-support system that assists, rather than replaces, expert judgment, and it must include intuitive interfaces for clinicians to easily review and edit the AI-generated segmentations.

While this study demonstrates strong technical performance, we acknowledge its limitations, which present avenues for future research. First, our claim of reducing annotation costs is based on established literature rather than a direct time-saving analysis conducted in a clinical setting. Future work should include a formal usability study to quantify the precise efficiency gains our method provides to ophthalmologists. Second, although the model performed well on rare classes like soft exudates (SEs), performance on underrepresented classes can always be improved. Future iterations of this work will explore advanced strategies beyond oversampling, such as implementing class-weighted loss functions or using generative models (GANs) to synthesize realistic training data for rare lesions, thereby further enhancing model robustness.

In summary, this work underscores the potential of prototype-based promptable segmentation to bridge the gap between annotation efficiency and clinical accuracy. Our findings highlight a promising direction for developing scalable, adaptable, and reliable medical image analysis tools that can be effectively integrated into modern clinical workflows.

Author Contributions

Conceptualization, Q.Y.; methodology, Q.Y.; software, Q.Y.; validation, Q.Y.; formal analysis, Q.Y.; resources, X.D.; data curation, Q.Y.; writing—original draft preparation, Q.Y.; writing—review and editing, X.D.; visualization, Q.Y.; supervision, X.D.; project administration, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study involving human subjects was approved by the Zibo Central Hospital Ethics Committee (Application No. 202102002) on 10 February 2021.

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study.

Data Availability Statement

The public datasets used in this study are available at https://idrid.grand-challenge.org/Data (accessed on 12 March 2024, IDRiD [6]) and https://github.com/nkicsl/DDR-dataset (accessed on 12 March 2024, DDR [7]). The private clinical dataset contains sensitive patient information and cannot be made publicly available due to privacy regulations and institutional policies.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SAM	Segment Anything Model
DR	Diabetic Retinopathy
EX	Hard Exudates
SE	Soft Exudates
MA	Microaneurysms
Prh	Preretinal Hemorrhage
HE	Hemorrhages
IoU	Intersection over Union
PACS	Picture Archiving and Communication System

References

Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Liu, Q.; Liu, H.; Ke, W.; Liang, Y. Automated lesion segmentation in fundus images with many-to-many reassembly of features. Pattern Recognit. 2023, 136, 109191. [Google Scholar] [CrossRef]
Huang, S.; Li, J.; Xiao, Y.; Shen, N.; Xu, T. RTNet: Relation Transformer Network for Diabetic Retinopathy Multi-lesion Segmentation. IEEE Trans. Med. Imaging 2022, 41, 1596–1607. [Google Scholar] [CrossRef] [PubMed]
Ding, Z.; Liang, Y.; Kan, S.; Liu, Q. HRDecoder: High-Resolution Decoder Network for Fundus Image Lesion Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; Springer: Cham, Switzerland, 2024; pp. 328–338. [Google Scholar]
Huang, Y.; Lin, L.; Li, M.; Wu, J.; Cheng, P.; Wang, K.; Yuan, J.; Tang, X. Automated hemorrhage detection from coarsely annotated fundus images in diabetic retinopathy. In Proceedings of the IEEE International Symposium on Biomedical Imaging, Iowa City, IA, USA, 3–7 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1369–1372. [Google Scholar]
Porwal, P.; Pachade, S.; Kokare, M.; Deshmukh, G.; Son, J.; Bae, W.; Liu, L.; Wang, J.; Liu, X.; Gao, L.; et al. Idrid: Diabetic retinopathy–segmentation and grading challenge. Med. Image Anal. 2020, 59, 101561. [Google Scholar] [CrossRef] [PubMed]
Li, T.; Gao, Y.; Wang, K.; Guo, S.; Liu, H.; Kang, H. Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. Inf. Sci. 2019, 501, 511–522. [Google Scholar] [CrossRef]
Wei, Q.; Li, X.; Yu, W.; Zhang, X.; Zhang, Y.; Hu, B.; Mo, B.; Gong, D.; Chen, N.; Ding, D.; et al. Learn to segment retinal lesions and beyond. In Proceedings of the International Conference on Pattern Recognition, Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7403–7410. [Google Scholar]
Li, T.; Bo, W.; Hu, C.; Kang, H.; Liu, H.; Wang, K.; Fu, H. Applications of deep learning in fundus images: A review. Med. Image Anal. 2021, 69, 101971. [Google Scholar] [CrossRef] [PubMed]
Niemeijer, M.; Van Ginneken, B.; Staal, J.; Suttorp-Schulten, M.S.; Abràmoff, M.D. Automatic detection of red lesions in digital color fundus photographs. IEEE Trans. Med. Imaging 2005, 24, 584–592. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Yin, Y.; Shi, J.; Fang, W.; Li, H.; Wang, X. Zoom-in-net: Deep mining lesions for diabetic retinopathy detection. In Proceedings of the Medical Image Computing and Computer Assisted Intervention, Quebec City, QC, Canada, 11–13 September 2017; Springer: Cham, Switzerland, 2017; pp. 267–275. [Google Scholar]
Zhou, Y.; He, X.; Huang, L.; Liu, L.; Zhu, F.; Cui, S.; Shao, L. Collaborative learning of semi-supervised segmentation and classification for medical images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2079–2088. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 4080–4090. [Google Scholar]
Zhou, T.; Wang, W.; Konukoglu, E.; Van Gool, L. Rethinking Semantic Segmentation: A Prototype View. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2582–2593. [Google Scholar]
Zhang, Z.; Ran, R.; Tian, C.; Zhou, H.; Li, X.; Yang, F.; Jiao, Z. Self-aware and cross-sample prototypical learning for semi-supervised medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; Springer: Cham, Switzerland, 2023; pp. 192–201. [Google Scholar]
Zhou, L.; Zhang, Y.; Zhang, J.; Qian, X.; Gong, C.; Sun, K.; Ding, Z.; Wang, X.; Li, Z.; Liu, Z.; et al. Prototype learning guided hybrid network for breast tumor segmentation in dce-mri. IEEE Trans. Med. Imaging 2024, 44, 244–258. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Lin, L.; Wong, K.K.; Tang, X. ProCNS: Progressive Prototype Calibration and Noise Suppression for Weakly-Supervised Medical Image Segmentation. IEEE J. Biomed. Health Inform. 2024, 29, 2845–2858. [Google Scholar] [CrossRef] [PubMed]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Cheng, D.; Qin, Z.; Jiang, Z.; Zhang, S.; Lao, Q.; Li, K. Sam on medical images: A comprehensive study on three prompt modes. arXiv 2023, arXiv:2305.00035. [Google Scholar] [CrossRef]
Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef] [PubMed]
Wong, H.E.; Rakic, M.; Guttag, J.; Dalca, A.V. Scribbleprompt: Fast and flexible interactive segmentation for any biomedical image. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 207–229. [Google Scholar]
Wang, H.; Guo, S.; Ye, J.; Deng, Z.; Cheng, J.; Li, T.; Chen, J.; Su, Y.; Huang, Z.; Shen, Y.; et al. Sam-med3d: Towards general-purpose segmentation models for volumetric medical images. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2025; pp. 51–67. [Google Scholar]
He, Y.; Guo, P.; Tang, Y.; Myronenko, A.; Nath, V.; Xu, Z.; Yang, D.; Zhao, C.; Simon, B.; Belue, M.; et al. VISTA3D: A unified segmentation foundation model for 3D medical imaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 20863–20873. [Google Scholar]
Du, Y.; Bai, F.; Huang, T.; Zhao, B. Segvol: Universal and interactive volumetric medical image segmentation. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 110746–110783. [Google Scholar]
Tang, H.; Liu, X.; Sun, S.; Yan, X.; Xie, X. Recurrent mask refinement for few-shot medical image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3918–3928. [Google Scholar]
Irving, B. maskSLIC: Regional superpixel generation with application to local pathology characterisation in medical images. arXiv 2016, arXiv:1606.09518. [Google Scholar]
Zhang, B.; Li, X.; Ye, Y.; Huang, Z.; Zhang, L. Prototype completion with primitive knowledge for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3754–3762. [Google Scholar]
Chen, W.Y.; Liu, Y.C.; Kira, Z.; Wang, Y.C.F.; Huang, J.B. A closer look at few-shot classification. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; pp. 1–12. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 173–190. [Google Scholar]

Figure 1. Illustration of promptable retinal lesion segmentation. The arrows in different colors mean different types of lesions.

Figure 2. Framework of our prototype-based promptable retinal lesion segmentation model.

Figure 3. Simulation workflow for generating coarse mask prompts from pixel-level annotations. This process mimics rapid clinical annotation by first identifying lesion areas, clustering them as a clinician might, and then fitting an enclosing ellipse with random scaling to simulate the imprecision of a quick manual marking. This creates realistic training pairs of image patches and their corresponding coarse prompts.

Figure 4. Illustration of superpixel-guided prototype weighting. (a) The actual lesion region (green outline) within the coarse prompt. (b) The prompt region is partitioned into superpixels. (c) A heat map of the prototype weights. The color bar indicates the weight value, where warmer colors (red) represent higher weights assigned to sub-prototypes that are more dissimilar to the background and thus more likely to be true lesion areas.

Figure 5. Visualization of refinement results on different categories of lesions.

Figure 6. Effect of reduction factor r on the number of coarse mask prompts. The ellipses in different colors mean different types of lesions.

Figure 7. Segmentation performance under different reduction factors.

Table 2. Comparisons with other methods on the DDR dataset. For overall comparison, the best results are highlighted in bold, the second-best are underlined, and our results are displayed in gray.

Methods	DDR (IoU (%) ↑)
Methods	MA	SE	EX	HE	Mean
Coarse Mask Prompt	6.9 ± 3.4	32.3 ± 10.8	14.8 ± 8.8	23.1 ± 11.4	19.3 ± 8.6
M2MRF [2]	12.7 ± 6.2	30.8 ± 8.7	40.8 ± 9.7	31.7 ± 10.3	29.0 ± 8.7
HRDecoder [4]	13.1 ± 6.9	29.9 ± 9.3	41.6 ± 10.1	33.8 ± 8.5	29.6 ± 8.7
ResNet18 [29]	56.3 ± 12.4	67.6 ± 13.6	52.3 ± 13.2	54.4 ± 13.3	57.6 ± 13.1
HRNet18 [30]	65.1 ± 13.5	68.2 ± 12.6	54.2 ± 13.9	58.5 ± 12.9	61.5 ± 13.3
U-Net [1]	60.9 ± 13.2	70.2 ± 15.0	55.4 ± 12.8	59.8 ± 12.7	61.6 ± 13.4
MedSAM [20]	51.2 ± 15.1	60.3 ± 19.2	43.9 ± 17.3	50.1 ± 14.8	51.4 ± 16.6
SAM-Med2D [22]	48.3 ± 12.4	62.5 ± 17.6	45.7 ± 15.0	52.3 ± 15.7	52.2 ± 15.2
Ours w/o superpixel	68.8 ± 12.0	71.2 ± 16.9	58.4 ± 12.7	61.3 ± 15.7	64.9 ± 14.3
Ours	69.6 ± 11.8	72.5 ± 17.3	59.9 ± 12.5	63.2 ± 14.5	66.3 ± 14.0

Table 3. Comparisons with other methods on the private dataset. For overall comparison, the best results are highlighted in bold, the second-best are underlined, and our results are displayed in gray.

Methods	Private (IoU (%) ↑)
Methods	Drusen	Prh	Mean
Coarse Mask Prompt	33.6 ± 19.1	49.7 ± 13.0	41.6 ± 16.0
ResNet18 [29]	43.9 ± 16.6	52.9 ± 15.2	48.4 ± 15.9
HRNet18 [30]	44.2 ± 21.1	57.5 ± 16.4	50.9 ± 18.8
U-Net [1]	45.2 ± 20.9	57.9 ± 16.4	51.6 ± 18.6
MedSAM [20]	40.1 ± 17.2	50.7 ± 17.8	45.4 ± 17.5
SAM-Med2D [22]	39.2 ± 20.4	53.5 ± 16.2	46.4 ± 18.3
Ours w/o superpixel	47.9 ± 23.4	56.3 ± 20.7	52.1 ± 22.1
Ours	49.6 ± 22.0	62.7 ± 16.7	56.2 ± 19.4

Table 4. Sensitivity analysis of the impact of the maximum number of superpixels

N_{m a x - s p}

. For overall comparison, the best results are highlighted in bold, the second-best are underlined, and our results are displayed in gray.

Table 4. Sensitivity analysis of the impact of the maximum number of superpixels

N_{m a x - s p}

. For overall comparison, the best results are highlighted in bold, the second-best are underlined, and our results are displayed in gray.

$N_{\max - sp}$	IDRiD (IoU (%) ↑)
$N_{\max - sp}$	MA	SE	EX	HE	Mean
1	84.2 ± 6.2	80.7 ± 9.2	65.3 ± 8.1	69.9 ± 13.9	75.0 ± 9.4
10	84.3 ± 6.2	81.0 ± 9.1	65.9 ± 8.5	71.8 ± 12.9	75.8 ± 9.1
20	84.3 ± 6.1	81.4 ± 9.6	66.7 ± 8.4	72.0 ± 11.3	76.1 ± 9.4
30	84.3 ± 6.1	81.5 ± 9.5	66.4 ± 8.9	72.4 ± 12.6	76.2 ± 9.2

Table 5. Segmentation performance with box prompts on different datasets.

Methods	IDRiD (IoU (%) ↑)	DDR (IoU (%) ↑)	Private (IoU (%) ↑)
Methods	Mean	Mean	Mean
Coarse Mask Prompt (Ellipse)	26.8 ± 6.7	19.3 ± 8.6	41.6 ± 16.0
Coarse Mask Prompt (Box)	23.4 ± 7.8	16.7 ± 9.0	36.2 ± 15.8
Ours (Elliptical Prompt)	76.1 ± 9.4	66.3 ± 14.0	56.2 ± 19.4
Ours (Box Prompt)	74.0 ± 10.0	64.5 ± 13.8	53.3 ± 20.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, Q.; Ding, X. Prototype-Guided Promptable Retinal Lesion Segmentation from Coarse Annotations. Electronics 2025, 14, 3252. https://doi.org/10.3390/electronics14163252

AMA Style

Yu Q, Ding X. Prototype-Guided Promptable Retinal Lesion Segmentation from Coarse Annotations. Electronics. 2025; 14(16):3252. https://doi.org/10.3390/electronics14163252

Chicago/Turabian Style

Yu, Qinji, and Xiaowei Ding. 2025. "Prototype-Guided Promptable Retinal Lesion Segmentation from Coarse Annotations" Electronics 14, no. 16: 3252. https://doi.org/10.3390/electronics14163252

APA Style

Yu, Q., & Ding, X. (2025). Prototype-Guided Promptable Retinal Lesion Segmentation from Coarse Annotations. Electronics, 14(16), 3252. https://doi.org/10.3390/electronics14163252

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prototype-Guided Promptable Retinal Lesion Segmentation from Coarse Annotations

Abstract

1. Introduction

2. Related Work

2.1. Retinal Lesion Segmentation

2.2. Prototype Learning

2.3. Promptable Segmentation

3. Methods

3.1. Coarse Mask Prompt Simulation

3.2. Prototype-Based Coarse Annotation Refinement

3.2.1. Feature Fusion

3.2.2. Prototype Extraction

3.2.3. Prototype-Based Segmentation

3.2.4. Training and Inference

3.3. Superpixel-Guided Prototype Weighting

4. Experiments and Results

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Implementation Details

4.3. Comparison Methods

4.4. Results on Each Dataset

4.4.1. Results on IDRiD Dataset

4.4.2. Cross-Dataset and Cross-Class Results

4.4.3. Qualitative Analysis

4.4.4. Sensitivity Analysis of Maximum Number of Superpixels

4.4.5. Generalization to Box Prompt

4.4.6. Effect of Prompt Coarseness on Segmentation Performance

5. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI