3. Results
In this study, we propose a U-Net-based system for automatic segmentation of macular holes (MHs) and cysts in OCT images. The experimental setup procedure for the proposed model is detailed below. The hardware and software environments used for the experiments are summarized in
Table 2.
Given our 2.5D input setting (C = 3 stacked adjacent slices) at a 512 × 512 resolution and our hardware constraints (RTX 4050 Laptop GPU with 6 GB VRAM), we adopted a UNet-48 backbone (48→96→192→384) to balance the representational capacity and memory footprint. Wider backbones or full 3D modeling would increase activation memory and force smaller batches or downsampling. Since we trained with small mini-batches (batch size: 3), we used Group Normalization (8 groups), which is more stable than BatchNorm under small-batch regimes.
To mitigate class imbalance, class weights were added (larger weights for small-volume classes and smaller weights for large-volume classes). Slight augmentations were applied to increase data diversity: z-jitter (p = 0.20), channel dropout (p = 0.10), low-level speckle noise (σ = 0.02, p = 0.50), and small-scale affine transforms (rotation ≤ 5°, translation ≤ 3%, scale 0.97–1.03, and shear ≤ 1.5°). Horizontal flip was used with p = 0.50. Training used AdamW (initial learning rate: 1.5 × 10−4; weight decay: 1 × 10−2). The learning rate followed a cosine annealing schedule with an initial ~800-step warm-up. For numerical stability, the upper bound of the gradient clipping was set to 1.0. Training was planned for a maximum of 24 epochs (mini-batch size: 3). Early stopping was performed on the validation set by monitoring the Dice (hole) metric: warm-up (8 epochs), patience (5 epochs), and minimum recovery threshold Δ = 0.002. Training stopped when improvement fell below this threshold.
Horizontal flip Test-Time Augmentation (TTA) was applied to each slice to average the probabilities of the original and projected outputs of the model; mixed precision was used for speed/stability when possible.
Lesion predictions were trimmed to anatomical boundaries with an intraretinal constraint. Small noise islands were removed by applying opening + closing for holes and closing for cysts. Neighborhood-slice consistency (±1 slice) was enforced by majority rule, and fragments below size thresholds proportional to the image area were suppressed. Slice and eye-level masks were generated. Performance is reported as macro Dice/IoU. HD95 was calculated for boundary quality, and ECE for calibration. Representative overlays and summary tables were generated.
SHAP attributes each input feature a signed contribution based on game-theoretic Shapley values, providing additive, pixel-level attributions for model predictions [
17]. We used GradientSHAP because it combines SmoothGrad-style noise smoothing with Integrated Gradients’ path integration to approximate Shapley values under input perturbations; compared with Grad-CAM (coarser, layer-level maps) and Vanilla Integrated Gradients (baseline/saturation sensitivity), GradientSHAP provides more stable, fine-grained attributions that align better with segmentation masks and support our quantitative metrics. To manage the computational cost of dense, pixel-level attributions, we performed GradientSHAP offline on a pre-selected subset of lesion-positive target slices (
n = 370 in this study), rather than on all slices. Target slices were chosen from the evaluation split by only retaining those with non-empty hole/cyst masks (in GT and/or prediction mode, depending on the analysis mode), and attribution metrics were only computed for the relevant class/mode pairs. Inputs were standardized with channel-wise z-score. We computed GradientSHAP attributions by averaging gradients over 128 samples/16 baselines. Baselines were smoothed using a slight Gaussian blur (
σ = 0.8); positive contributions were normalized, and percentile clipping (0.1–99.9) was applied. To ensure anatomical consistency, the heatmap was displayed only within the retina; attributions outside the retina were suppressed. This suppression was applied only for visualization clarity and did not affect the model outputs. Full-field heatmaps can be generated by disabling the retinal masking when needed. As the reference distribution, we employed Gaussian fuzzy noise, which preserves global intensity statistics while attenuating edges. This avoids artificial contours from zero/black baselines in OCT images and reduces edge-contrast bias, producing smoother, less spurious attributions near anatomical boundaries.
Heatmaps were ranked by attribution magnitude and binarized to obtain top-10% (τ = 0.10) and top-20% (τ = 0.20) masks, allowing for the evaluation of narrow-focus (τ = 0.10) and wider-focus (τ = 0.20) regimes.
All metrics were calculated separately using two reference masks: (i) Ground truth (GT)-referenced mode: SHAP attributions and model output were compared with the GT mask; coverage, leakage, and focus-consistency were measured with respect to GT lesion boundaries. (ii) Prediction (Pred)-referenced mode: The same metrics were calculated based on the model’s own segmentation mask, assessing the self-consistency of the explanations with the model’s decisions. This dual view allowed us to assess the consistency of the explanations with both the clinical labels and the model’s own decisions. The SHAP targets and running parameters are summarized in
Table 3.
The following metrics were used: segmentation quality (Dice, IoU, HD95, and ECE) in the validation and testing phases, and explanation quality (APILτ, ARILτ, Diceτ, Leakτ, COM-dist, and nCOM) in the SHAP analysis phase, each computed in the ground truth (GT) and prediction (Pred) modes. In this study, APILτ, ARILτ, Diceτ, and Leakτ were calculated for τ = 10 and 20; τ = 5 was also used in the factor analysis section. APIL10 denotes APILτ with τ = 10. We performed a sensitivity analysis over τ ∈ {5%, 10%, 20%} (top-τ% attribution thresholds) and report the full τ-sweep results in
Supplementary Tables S2–S4. Our main SHAP-based conclusions were consistent across τ levels.
Present-only metrics were computed only on slices that contain a lesion. For each class, Dice and IoU were computed per slice and then averaged across slices (macro per-slice). Consequently, the averaged Dice and IoU are not expected to satisfy the closed-form conversion IoU = Dice/(2 − Dice). Unless otherwise stated, metrics included absent-class slices, which contribute a neutral value due to smoothing; ‘present-only’ averaging was used only in the analyses explicitly marked using this method. All explanation metrics are reported separately for the GT (clinical label alignment) and Pred (model self-consistency) modes. The scope and usage of parameters are summarized in
Table 4.
To make SHAP visualizations more interpretable, we applied factor analysis. In addition to the explanation metrics, we defined Leak20 (=1 − APIL20, proportion of attribution outside the lesion), Lesion Pixels (LP) (lesion area), Positive Attribution Mass (PAM) (total positive SHAP contribution), and Focus Concentration (FC = APIL5/APIL20). To control for size effects, we used and its sign-reversed form Focus Proximity (FP = −nCOM); a higher FP indicates that attributions are concentrated closer to the lesion.
All features were scaled with StandardScaler, followed by a three-component principal component analysis (PCA). We then extracted three clusters with k-means in the PC1–PC2 space. The outputs include loading heatmaps (PC1–PC3), a PC1–PC2 scatterplot, and cluster-wise metric means. The component interpretations were as follows: (i) PC1 captures the success–leakage contrast, which increases when APIL and Dice load positively and Leak20 loads negatively, corresponding to lesion-focused, low peri-lesion patterns; (ii) PC2 reflects retinal containment and spatial focusing, which rises with ARIL and FP, indicating contributions concentrated within the retina near the lesion; (iii) PC3 emphasizes size-independent focusing, which is driven mainly by FP.
k-means clustering revealed three regimes in the PC1–PC2 plane: (i) a retina-dominant regime—where the focus is mainly within the retina with limited lesion overlap (low APIL/Dice, and high Leak20); (ii) a peri-lesional regime—where the focus extends around the lesion (low APIL and Dice are relatively higher; Leak20 is relatively lower but still noticeable); and (iii) a narrow-coverage regime—where APIL and Dice are near zero, Leak is near one, and the intraretinal coverage is low. Cluster names are data-driven and chosen for visual interpretation; each represents a deviation from an ideal intra-lesion focus. This holistic view combines multiple metrics in a low-dimensional space to jointly examine focus–coverage–leakage dynamics. Loadings summarize metric–component relationships, while scatter/cluster plots show how the samples are distributed across regimes. For clarity, Algorithm 1 summarizes the overall end-to-end workflow of the proposed segmentation and regime-based quantitative XAI pipeline.
| Algorithm 1. Overview of the proposed pipeline |
| 1. Split patients/eyes into train/val/test (70/15/15, leakage-safe). |
| 2. Pack inputs as 2.5D (C = 3) or 2D (C = 1). |
| 3. Split patients/eyes into train/val/test (70/15/15, leakage-safe). |
| 4. Evaluate on test set and report Dice/IoU (and HD95/ECE). |
| 5. Select lesion-positive target slices for XAI analysis (GT- and Pred-referenced). |
| 6. Compute GradientSHAP attribution maps per class (offline). |
| 7. Threshold at top-τ% (τ ∈ {5, 10, 20}) and compute APILτ, ARILτ, Diceτ, Leakτ, and COM-dist. |
| 8. Assign samples to regimes (retina-dominant/peri-lesion/narrow-coverage) and summarize results. |
SHAP summaries were cross-tabulated with image-quality flags (signal shield, image blur, and low signal strength) and all analyses followed the present-only rule. Unless otherwise noted, comparisons used τ = 20 (Dice20, APIL20, and Leak20). Cyst size was stratified by area quartiles; small cysts were defined as the lower quartile (Q1) with a threshold of 518 pixels and compared with medium/large cysts; Cohen’s d was reported where applicable. For image-quality flags (signal shield and image blur), we report Δ = flagged − clean differences. The best/worst slices were selected by macro-Dice. For each sample, we show the mask, Top-10 Attribution–Lesion Overlay (T10 overlay), and SHAP heatmap together. Finally, success quartiles Q1–Q4 were derived from present-only Dice20, reporting the per-quartile metric means and counts of hole-present/cyst-present slices.
Figure 3 presents the training and validation curves and the per-class Dice evolution across epochs.
While the UNet-48/GN (512 × 512) model achieved high Dice/IoU values in the retina, hole, and choroid classes on the test set, the cyst scores were lower. The lesion-class HD95 mean was 6.0 px, indicating small boundary errors. The ECE was 0.008 (computed on hole/cyst only slices), suggesting good calibration (
Table 5). To contextualize these results, we evaluated classic U-Net baselines (UNet-64 + BatchNorm) at full resolution (512 × 512) on the same test split in both 2D (
C = 1) and 2.5D (
C = 3) settings, alongside our proposed 2.5D UNet-48 + GroupNorm. The proposed model achieved a Dice/IoU of 0.947/0.915 (mean, BG included), with a macular hole Dice/IoU of 0.941/0.910 and cyst Dice/IoU of 0.874/0.805. In comparison, the 2D U-Net baseline reached a Dice/IoU of 0.920/0.874, with a macular hole Dice/IoU of 0.917/0.883 and cyst Dice/IoU of 0.843/0.771, while the 2.5D U-Net-64 baseline obtained an overall Dice/IoU of 0.877/0.836 and macular hole Dice/IoU of 0.709/0.708. These results support our choice of UNet-48 as a more memory-efficient backbone for 2.5D segmentation under limited GPU resources and Group Normalization as a stable normalization choice for small mini-batches (see
Supplementary Table S1).
The macro scores in Pred mode were below those in GT, indicating better alignment of annotations with clinical labels (GT) and a more conservative focus compared to model segmentation. In terms of spatial accuracy, COM-dist was lower in Pred mode (8.98 px vs. 10.24 px), meaning the SHAP center of mass lay closer to the lesion center in Pred than in GT, suggesting a stronger fit (
Table 6). As τ increased from 10% to 20%, the top-τ area increased, while Dice and APIL decreased and Leak increased; COM-dist remained lower in Pred mode (
Figure 4).
For holes, the peri-lesion (
n = 79) cluster in GT mode showed the highest intraretinal coverage (ARIL20 = 0.684) with relatively lower leakage (Leak20 = 0.917). Meanwhile, the COM-dist of 16.22 px and FP of −0.21 indicate a shift in focus to the peri-lesion region; the large area (LP = 6487.4 px) supports this pattern. In the retina-dominant cluster (
n = 77), ARIL20 = 0.659, but APIL20/Dice20 (0.017/0.032) was low and Leak20 was high (0.983); however, the COM-dist of 7.84 px and FP of −0.24 indicate a closer focus. The LP of 1282.4 px indicates that the field is smaller and the FC of 3.99 indicates a tight focus. Two main improvements were evident in Pred mode: (i) ARIL20 increased in both main clusters, and (ii) spatial proximity improved. Furthermore, a narrow-coverage cluster (
n = 15) appeared, representing difficult slices with small/partial holes where coverage failed. The cluster-level FP, LP, PAM, and FC indicators add an operational layer to clinical interpretation by combining focal proximity, lesion size, total positive contribution, and focal tightness (
Table 7).
For cysts, in GT mode, the peri-lesion (
n = 113) cluster reflected a peripheral focus in large lesions (LP = 4386.4 px) with an ARIL20 of 0.719, Leak20 of 0.940, COM-dist of 10.17 px, and FP of −0.16. The retina-dominant (
n = 253) cluster showed a relatively close but non-specific focus near the lesion, with an ARIL20 of 0.654, very low APIL20/Dice20 (0.012/0.024), Leak20 of 0.988, COM-dist of 6.91 px, and FP of −0.49. The narrow-coverage cluster (
n = 4) was very small (LP = 7.0 px) and represented coverage failure (Leak20 = 1.00, ARIL20 = 0.494, FP = −22.59, and COM-dist = 52.02 px). In Pred mode, ARIL20 increased significantly in all main clusters, and spatial proximity improved; partial recovery was also observed in the narrow-coverage cases. Overall, LP was largest in PL clusters, while FP/COM-dist (focal proximity), PAM (positive contribution), and FC (focal tightness) provide complementary cues for clinical interpretation (
Table 8).
The narrow-coverage regime was markedly less prevalent in both modes (hole: 0% in GT and 9.3% in Pred; cyst: 1.1% in GT and 3.9% in Pred). In cysts, the retinal-dominant cluster was approximately twice as common as the peri-lesional cluster (GT: 68.4% vs. 30.5%; Pred: 65.0% vs. 31.1%), while for holes, the two patterns were similarly frequent (GT: 49.4% vs. 50.6%; Pred: 46.6% vs. 44.1%). The higher narrow-coverage rate in Pred than GT (hole: 0% vs. 9.3%; cyst: 1.1% vs. 3.9%) suggests that the model produced narrow-coverage foci more often when explaining its own segmentations. These may signal false-positive bias or partial/incomplete coverage that warrant extra validation. By contrast, the retina-dominant/peri-lesion distributions were closely matched across modes, indicating that the primary pattern mix was preserved; the difference stemmed mainly from the rise in the narrow-coverage cluster.
Three components emerged from the factor analysis: PC1 represents the contrast between focus/coverage and leakage (positive APIL and DICE, negative Leak); PC2 captures intraretinal localization and spatial proximity (ARIL and FP positive); and PC3 reflects pure focus without size effects (FP dominant). The GT and Pred patterns were similar. Loadings were more pronounced in Pred, particularly in PC3, suggesting a relatively stronger representation of the focus component in the model descriptions (
Figure 5).
On the PC1–PC2 plane, the samples split into three regimes: retinal-dominant (high ARIL, limited APIL/Dice, high Leak, and generally low–moderate FP), peri-lesion (high Leak and relatively better APIL/Dice), and narrow-coverage (APIL/Dice ≈ 0, low ARIL, Leak ≈ 1, and very negative FP). In Pred, the clusters separated more clearly, indicating stronger self-consistency of the explanations with the model’s segmentation. Clinically, the peri-lesion regime can shift heat toward bright band/layer edges; narrow-coverage regimes often reflect incomplete coverage in small or fragmented cysts; and retinal-dominant concentrates within the retina have limited lesion overlap (
Figure 6).
The SHAP maps were divided into three patterns: (i) retina-dominant patterns where the heat primarily concentrates within the retina near the lesion. The spatial focus is consistent but the lesion specificity is limited (low APIL/Dice and high Leak). ARIL is high (though lower than in peri-lesion patterns), and COM-dist is generally short. This regime is typically driven by intra-retinal reflectivity—non-specific reflectance and layer-texture inside the retina—so high-contrast yet non-diagnostic cues attract attributions to nearby retinal tissue without precisely overlapping the lesion. (ii) In peri-lesion patterns, heat concentrates around the lesion perimeter and along layer edges, and partly penetrates into the lesion. As a result, ARIL is highest, APIL/Dice are higher, and Leak is lower than in retinal-dominant patterns, while COM-dist is longer (the ring shifts the focus outward). This regime is typically driven by lesion-border hyperreflectivity—sharp, reflective rims (bright bands/specular reflections) and boundary artifacts—which pull attributions to the rim and can be amplified by mild over-segmentation across the boundary. (iii) In narrow-coverage patterns, heat captures only part of the lesion, particularly in small/fragmented cysts or thick retinal sections. Here, APIL/Dice are very low (≈0), ARIL is low, Leak ≈ 1, and COM-dist is variable, indicating incomplete coverage. Accordingly, a retinal-dominant pattern can be characterized as “retina-focused but non-specific,” a peri-lesion pattern can be characterized as having “marginal/artifact risk,” and narrow-coverage patterns have “small-lesion sensitivity” (
Figure 7).
In GT mode, peri-lesion cases showed the strongest overlap scores (e.g., hole Dice/IoU = 0.842/0.737, cyst Dice/IoU = 0.840/0.747), while retina-dominant cases were slightly lower but comparable (hole: 0.831/0.736; cyst: 0.805/0.714). In cysts in GT mode, narrow-coverage corresponds to reduced overlap (0.776/0.647, with a small sample count), consistent with difficult small/fragmented lesions. In Pred mode, we observe the same overall tendency that peri-lesion remains strong (hole 0.847/0.744, cyst 0.842/0.749), supporting the use of narrow-coverage and/or attribution drift as a practical ‘review-needed’ flag. We summarize the segmentation quality (Dice/IoU) stratified by SHAP regime in
Supplementary Tables S5 and S6 to support the intended use of regimes as reliability/warning indicators.
Figure 8 shows the typical focus patterns of the retinal-dominant (RD), peri-lesion (PL), and narrow-coverage (NC) profiles in real slices. Focus centering, leakage, and coverage patterns varied across clusters, consistent with the APIL, Dice, Leak, and COM-dist values.
In the best–worst eye comparison, the Dice score for holes was stable with an approximately 3% difference between the best and worst eyes, whereas cyst Dice decreased by approximately 13%. The performance loss was therefore primarily driven by cysts. In small cysts (n = 252), Dice = 0.784 ± 0.202; in medium/large cysts (n = 118), it was 0.906 ± 0.036, yielding Δ = −0.122 (Cohen’s d = −0.73).
ECE was low for both classes (hole: 0.007; cyst: 0.010). Based on performance quartiles, there were fewer slices with holes (n = 7) and slices with cysts were more common (n = 90) in the lowest-performing group; in the highest-performing group, the number of slices with holes increased (n = 53) while the number of slices with cysts was similar (n = 89). This suggests that low-performing samples were cyst-dominant and harder cases, whereas high-performing samples showed more stable overall segmentation.
For slices flagged with signal shield (
n = 47), we observed a Dice20 Δ = +0.018, APIL20 Δ = +0.010, and Leak20 Δ = −0.010; for image blur (
n = 12), Dice20 Δ = −0.0056, APIL20 Δ = −0.0034, and Leak20 Δ = +0.0034. Signal shield appears to direct attention into the lesion (higher focus/coverage, lower leakage), whereas image blur introduces edge ambiguity (lower focus/coverage, higher leakage). Δ is not reported for low signal strength because there were no positive examples (
Figure 9).
In the best examples, the cysts are large and well-defined, segmentation was high, and annotation was focused within the lesion. In the worst examples, small/fragmented cysts and focal distortion due to layer-boundary artifacts were predominant. This contrast is consistent with the size effect and trends in the quadrant analysis (
Figure 10).
4. Discussion
In our study, we achieved a Dice/IoU of 0.941/0.910 for macular hole (MH) segmentation and Dice/IoU of 0.874/0.805 for cyst segmentation on the OIMHS dataset. Using the same dataset, Herath et al. reported the best MH performance with InceptionNetV4 + U-Net (Dice = 0.9672) and emphasized that HD95 may be unreliable for small structures, highlighting the importance of overlap-based metrics for MH evaluation [
9]. Kulyabin et al. also used the same OIMHS benchmark and reported overall Dice scores of 0.913 (MH) and 0.902 (IRC) using prompt-driven volumetric segmentation with SAM 2/MedSAM 2, highlighting strong performance under a different interaction/evaluation setting than our fully automatic pipeline [
18]. Although the OIMHS dataset is a public benchmark, direct numerical comparisons across studies should be interpreted cautiously because train/validation/test splits, preprocessing choices, and interaction/evaluation settings may differ. For reproducibility, we used a stage-balanced, patient-level 70/15/15 split (keeping all slices from the same patient/eye in a single subset) and the report results on the held-out test set. Using a different dataset, Frawley et al. demonstrated that volumetric 3D U-Net variants can achieve strong MH overlap (mean IoU = 0.876 ± 0.012, with reported example Dice values in the 0.94–0.97 range) [
19]. Automated 3D segmentation has also been applied to MH morphology and measurements, reporting high validation accuracy (99.19%) and clinically meaningful variability in anatomical measurements [
20]. Similarly, Pereira et al. showed that automated MH volume estimation can closely match manual grading (R
2 = 0.94) and may correlate with postoperative visual recovery better than the minimum linear diameter [
21]. For intraretinal cyst segmentation, prior work reports lower overlap scores than for MH segmentation, with Dice = 0.71 in Girish et al. [
22], mean Dice = 0.78 (OPTIMA)/0.81 (KERMANY) in Ganjee et al. [
23], and Dice = 0.69 (OCSC)/0.67 (DME)/0.79 (AEI) in Gopinath and Sivaswamy [
24], underscoring the higher difficulty of cyst delineation. Overall,
Table 9 places our results within the context of recent literature and suggests that our 2.5D UNet-48 design provides competitive segmentation accuracy while remaining computationally practical.
The 2.5D U-Net offers a practical balance between the context-free nature of 2D approaches and the memory/labeling costs of 3D approaches: lower hardware requirements and shorter training times [
19,
25,
26], consistent segmentation with multiple scan planes and strong data augmentation [
27], and reduced boundary ambiguity by incorporating neighboring-slice context [
28,
29,
30]. The computational efficiency and accuracy trade-offs of lightweight (MobileNetV2 + U-Net) and heavyweight (InceptionV4 + U-Net) backbones have been demonstrated using the OIMHS dataset [
9]. Recent OCT studies also report strong results with alternative paradigms. Kulyabin et al. demonstrate that prompt-driven foundation models can achieve high volumetric biomarker segmentation performance on the OIMHS dataset under an interaction-based evaluation setup, which is not directly comparable to fully automatic 2D/2.5D/3D baselines [
18]. In contrast, Toğaçar et al. proposed a CNN-activation–based retinal disease detection approach and reported overall accuracies of 99.60%, 99.89%, and 97.49% across three OCT datasets, highlighting the broader interest in activation/attribution-driven interpretability beyond segmentation [
31]. Our aim was not only to achieve high accuracy but also to leverage the advantages of the 2.5D approach to transform saliency annotations into regime-based confidence signals and integrate them into clinical practice. To this end, we quantified the annotations using SHAP coverage metrics and factor analysis.
Accurate, qualitative, and quantitative identification of macular hole morphology is critical for surgical planning and prognosis. Parameters such as hole dimensions, base and minimum linear diameter (MLD), hole depth, and volume have been shown to be closely related to visual prognosis during preoperative evaluation [
3,
32]. Furthermore, hole chronicity is a determinant of surgical success and postoperative visual improvement [
33]. Accurate segmentation of intraretinal cysts is also crucial for monitoring retinal fluid distribution and assessing treatment response [
34,
35]. By generating automatic and reproducible segmentation for MHs and cysts, this study lays a solid foundation for the future derivation of reliable and comparable measurements such as MLD, basal diameter (BD), depth, and volume. Furthermore, the degree to which automated measurements can be interpreted under various conditions is also important; this is where SHAP analysis, which converts descriptions into regime-based quantitative signals, comes into play.
In our study, we applied SHAP analysis to understand the model’s decision-making processes and improve its clinical interpretability. SHAP is used in medical image analysis to improve the understandability of black-box models [
17,
36]. In the OCT literature, SHAP has been used primarily for interpreting of feature/layer thickness-level annotations and classification models [
37]. Explainability in the context of macular holes has been mostly reported with Grad-CAM visualizations [
10,
38]. Such saliency maps need not be limited to images alone; they can be quantified against clinical segmentation and evaluated with standard metrics. Quantitatively comparing the annotation with the clinical mask rather than limiting the annotation to images alone provides a more robust assessment against biases inherent to saliency methods (e.g., edge/contrast sensitivity) [
39]. Saporta et al. standardized the quantitative comparison of saliency heatmaps with clinical segmentation masks on chest radiographs and systematically evaluated multiple saliency methods and architectures with metrics such as IoU and pointing-game [
40]. In this study, we not only presented SHAP overlay images but also numerically evaluated the overlap of the overlay with the predicted and true masks using APIL, Dice, and Leak metrics. Furthermore, we classified these focal patterns into retinal-dominant, peri-lesion, and narrow-coverage regimes using PCA and k-means-based factor analysis, making the model’s explanations interpretable not only visually but also numerically and categorically. This approach allows for the direct identification of error sources (such as small cyst size, artifacts, and narrow-coverage) associated with the explanations, enabling clinical prediction of cases in which the model can be trusted.
For clinical use, the three annotation patterns can be interpreted as a quality control/warning layer for the model. In the retina-dominant profile, the output is often reassuring because the focus is primarily within the retina and close to the lesion. However, the risk of false positives should be reviewed due to the possibility of a non-specific focus, especially along bright bands and border reflections. The peri-lesion profile indicates a focus pattern that often follows the lesion vicinity and may still coincide with acceptable segmentation quality. In this case, checking for artifacts, segment boundaries, and possible over-segmentation is recommended, and re-acquisition or manual correction should be performed if necessary. The narrow-coverage profile is most often seen in small/fragmented cysts and thick retinal slices. It is appropriate to examine adjacent slices, evaluate the volume in three dimensions, and adjust the threshold/filter settings if necessary. While segmentation masks are spatial outputs, they do not explain why a prediction was produced nor whether it is trustworthy. Therefore, our SHAP analysis is not used to rediscover lesion location but to derive quantitative, case-level reliability signals by measuring attribution–mask alignment and organizing explanations into operational regimes (retina-dominant, peri-lesion, and narrow-coverage). In clinical deployment, where ground truth masks are unavailable, the Pred-referenced regime can serve as a practical self-consistency indicator: peri-lesion and retina-dominant patterns generally coincide with stable Dice/IoU behavior, whereas the narrow-coverage pattern and/or large attribution drift (high COM-dist) flag cases that warrant additional review. Notably, in the retina-dominant regime, the attribution mass is concentrated within retinal tissue and remains spatially close to the lesion region (low COM-dist), but it is not necessarily confined to the lesion mask (often accompanied by high leakage), which motivates treating the regimes as warning/triage signals rather than as guarantees of correctness. Recent Shapley-based studies similarly emphasized that pixel-level heatmaps often require additional interpretation (e.g., grouping analysis) to become actionable [
41]. Ren et al. introduced a contrast-level Shapley framework to assess how MRI contrasts contribute to segmentation decisions [
41]. In contrast, our objective is not to explain input modalities, but to operationalize pixel-level GradientSHAP into a deployment-oriented reliability layer for identifying potentially unreliable segmentations when clinical ground truth is unavailable.
For both holes and cysts, the retina-dominant cluster was the most common cluster, and the narrow-coverage cluster was the least common. In holes, central slices usually contained a larger lesion area, so narrow-coverage patterns were less frequent and the segmentation boundaries appeared more reliable than in cysts. In cysts, the narrow-coverage cluster was also the least common cluster, but its segmentation reliability was relatively lower because it could be seen even in central slices. However, in current clinical practice, precise margin-related measurements of cysts are not routinely performed. The primary clinical need is cyst detection and monitoring. Therefore, the practical impact is limited.
The correspondence of model foci in macular holes to clinically significant regions is consistent with explainability examples in the literature. The hole edges are frequently highlighted in heatmaps generated by Grad-CAM-like methods. Indeed, in a multicenter closure prediction study by Hu et al., it was shown that the focus was concentrated on the hole and surrounding retina [
10]. Similarly, Mariotti et al.’s AI-based quantitative biomarker analysis correlated improvement in ELM/EZ integrity with visual gain, while the images highlighted areas adjacent to the macular hole margin [
38]. Because ELM/EZ continuity at the photoreceptor band level is closely linked to visual prognosis, interruptions in these layers become apparent in the annotation maps [
38]. In our study, the SHAP overlays also highlighted exposed retinal pigment epithelium (RPE) regions at the base of the MHs, outer plexiform layer borders, and cyst walls. This finding suggests that the model uses regions of transition from hyperreflective to hyporeflective areas as discriminatory cues in OCT decisions; in other words, focus tends to be concentrated at clinically significant borders and interfaces.
This study has several limitations. First, the evaluation was performed with a single data source, and external validation was not performed. Therefore, the generalizability of the findings is limited. We focused on the publicly available OIMHS dataset to ensure full reproducibility (labels, preprocessing, and evaluation) and because annotated OCT volumes for macular hole and cyst segmentation are limited in an open multi-center form; external validation requires access to additional labeled cohorts, which is outside the scope of this initial benchmark study. Second, the 2.5D architecture may be more susceptible to uncertainty, particularly in small/fragmented cysts, as it cannot capture full 3D continuity. Third, SHAP-based quantification is sensitive to the selection of τ thresholds, preprocessing steps, and the need for pixel-based calibration of center-of-mass distance-based measurements. Fourth, label subjectivity in manual masks can also influence the results. Finally, the lack of comprehensive, multicenter comparisons with current 3D/hybrid architectures results in uncertain generalizability and comparability. Fifth, we did not conduct a clinical user study or expert-centered evaluation; therefore, the clinical utility of the proposed explanations has not yet been validated. As future work, we will validate the model across centers/devices and run ablation studies comparing 2.5D with 2D/3D/hybrid architectures. We will quantify the domain shift across devices and centers by reporting performance and calibration per site/scanner and assess robustness via simple adaptation strategies. We will also analyze the sensitivity of SHAP to the baseline choice and τ levels, and assess deployment-oriented improvements such as uncertainty estimation and clinician-in-the-loop verification. In addition, we will perform an expert-in-the-loop evaluation in which ophthalmology specialists assess the segmentation quality and explanation usefulness or trustworthiness using representative cases.