3.3.1. Main Image-Evaluation Form
The main evaluation form included 21 complete responses, equally distributed among retina specialists (
n = 7), general ophthalmologists or other non-retina subspecialists (
n = 7), and ophthalmology residents (
n = 7). A subset of four participants (1 retina specialist, 1 general ophthalmologist, and 2 residents) additionally completed the supplementary form with extended image sets. No missing data were observed.
Figure 3 presents ten random panels given to the specialists.
In the main form, which contained 10 image sets and therefore 630 image-level technique ratings, TRAST achieved the highest ratings across all summary metrics. Its mean score was 2.49 with a median of 3 and an interquartile range of 2 to 3, and 61.0% of its ratings fell in the favorable 3 to 4 range. Grad-CAM++ was intermediate at a mean of 1.67, median of 2, and IQR of 1 to 2, with 24.3% favorable ratings. CGFM was lowest at a mean of 1.41, median of 1, and IQR of 0 to 2, with 20.5% favorable ratings. These results are summarized in
Table 5.
At the respondent-summary level, the three techniques differed significantly overall by the Friedman test, chi-square = 14.34, p = 0.00077, with Kendall’s W = 0.34, which is a moderate repeated-measures effect. Post hoc Wilcoxon tests showed that TRAST was superior to CGFM after Holm correction, adjusted p = 0.0006, and also superior to Grad-CAM++, adjusted p = 0.010. CGFM versus Grad-CAM++ was only borderline, p = 0.051, so the real separation in the main form is mainly between TRAST and the other two techniques rather than between CGFM and Grad-CAM++.
The image-by-image pattern in the main form was consistent across cases, with TRAST achieving the highest or joint-highest ratings in the majority of image sets. The detailed image-level results are summarized in
Table 6.
The overall statistical pattern is also reflected in
Figure 4.
Figure 4a shows the clear separation in mean Likert score between the three methods, with TRAST highest, Grad-CAM++ intermediate, and CGFM lowest. These ranks are preserved across the three professional groups (
Figure 4b), although retina specialists appear more favorable to Grad-CAM++ than the other groups.
Figure 4c further supports the same conclusion by showing that TRAST accumulated a much larger share of ratings in the higher part of the scale, whereas CGFM and Grad-CAM++ had a greater concentration of low ratings.
3.3.2. Supplementary Image-Evaluation Form
The extra-images form adds 20 more image sets and 240 more ratings. The supplementary results showed a consistent descriptive trend favoring TRAST across most image sets. The descriptive pattern is again favorable to TRAST. Across all supplementary images, TRAST had a mean of 2.33, median of 2, and IQR of 2 to 3, with 36.3% favorable ratings of 3 to 4. Grad-CAM++ had a mean of 1.53 and median of 1, with 16.3% favorable ratings, while CGFM had a mean of 1.20 and median of 1, with 13.8% favorable ratings. A respondent-level Friedman test still reached significance, chi-square = 6.50, p = 0.039, with Kendall’s W = 0.81, but that very large effect size should not be overinterpreted because it is driven by only four raters. Pairwise Wilcoxon tests were not significant. Therefore, the supplementary analysis is treated as supportive evidence for the descriptive pattern observed in the main form rather than as a standalone confirmatory result.
The CNV supplementary block, sets 11 to 14, was consistently favorable to TRAST. Set 11 showed TRAST at 2.00 while both CGFM and Grad-CAM++ were only 0.75, narrowly missing significance at p = 0.058. Set 12 showed TRAST at 2.50, Grad-CAM++ at 1.75, and CGFM at 0.50, p = 0.030. Set 13 again favored TRAST at 2.25 over CGFM at 1.25 and Grad-CAM++ at 0.50, p = 0.037. Set 14 repeated the same pattern, with TRAST at 2.00 and the other two both below 1, p = 0.023. The extra CNV images strengthen the evidence that TRAST is more convincing than the alternatives for CNV localization.
The CSR supplementary block, sets 15 to 18, also leaned toward TRAST, though with more variability. Set 15 showed TRAST at 2.50, but the difference was not significant, p = 0.424. Set 16 was stronger, with TRAST at 2.75, CGFM at 2.25, and Grad-CAM++ at 1.50, p = 0.037. Set 17 still favored TRAST at 2.00 over 1.75 for CGFM and 0.75 for Grad-CAM++, p = 0.039. Set 18 gave the strongest CSR result in the supplementary block, with TRAST at 3.00, CGFM at 2.25, and Grad-CAM++ at 0.75, although the p-value remained 0.092 because of the small sample. In practical terms, the CSR extra images suggest that TRAST is again the most stable option, while CGFM can sometimes perform reasonably well, and Grad-CAM++ is the least consistent.
The DME supplementary block, sets 19 to 22, is the one area where Grad-CAM++ most clearly challenges TRAST. Set 19 favored TRAST at 2.50 over Grad-CAM++ at 1.75 and CGFM at 1.00, p = 0.092. Set 20 was one of the few images where Grad-CAM++ came first, with a mean of 2.25 versus 2.00 for TRAST and 0.50 for CGFM, p = 0.023. Set 21 repeated that pattern, with Grad-CAM++ at 2.50, TRAST at 2.00, and CGFM at 0.50, p = 0.061. Set 22 returned the advantage to TRAST at 2.50, while CGFM and Grad-CAM++ were both only 0.75, p = 0.085. So, for these extra DME images, TRAST is still competitive, but Grad-CAM++ appears much more image-sensitive and sometimes does better.
The DRUSEN supplementary block, sets 23 to 26, was more mixed and generally friendlier to both TRAST and CGFM than to Grad-CAM++. Set 23 was actually the strongest Grad-CAM++ case in the entire supplementary form, with a mean of 3.25, ahead of TRAST at 2.75 and CGFM at 2.25, although the p-value was only 0.156. Set 24 strongly favored TRAST at 2.75, with CGFM at 2.25 and Grad-CAM++ at 0.75, p = 0.024. Set 25 again had TRAST highest at 2.75, but CGFM was close at 2.50, and Grad-CAM++ lagged at 1.25, p = 0.097. Set 26 was nearly a tie between TRAST and CGFM, both at 2.25, with Grad-CAM++ at 1.50, p = 0.584. This disease category therefore looks less one-sided than CNV or MH, but even here TRAST remains at least competitive in every image and clearly better than Grad-CAM++ in most of them.
The MH supplementary block, sets 27 to 30, was notable because CGFM essentially collapsed. Its means ranged only from 0.25 to 0.50 across those four images. TRAST and Grad-CAM++ were much stronger. Set 27 still favored TRAST slightly, 1.50 versus 1.25 for Grad-CAM++ and 0.50 for CGFM, p = 0.097. Sets 28 and 29 showed Grad-CAM++ at 2.25, tying or slightly exceeding TRAST at 2.25 and 2.00. Set 30 gave Grad-CAM++ its clearest MH advantage at 2.50 versus 2.25 for TRAST and 0.25 for CGFM, p = 0.058. So, for the extra MH images, the practical conclusion is that CGFM is not convincing, while TRAST and Grad-CAM++ are the methods that remain clinically plausible.
3.3.3. Combined Pattern Across All Image Ratings
We merged the main and supplementary image ratings, ending up with 870 technique-specific observations. Overall, TRAST has a mean of 2.45, a median of 3, and 54.1% favorable ratings of 3 to 4. Grad-CAM++ has a mean of 1.63, a median of 1.5, and 22.1% favorable ratings. CGFM has a mean of 1.36, a median of 1, and 18.6% favorable ratings. 55.2% of TRAST ratings were 3 or 4, while only 22.1% of Grad-CAM++ and 18.6% of CGFM reached that range. At the low end, 57.2% of CGFM ratings and 50.0% of Grad-CAM++ ratings were 0 or 1, compared with only 15.2% for TRAST.
Figure 5 provides additional detail at the individual expert level. The heatmap in
Figure 5a shows that most experts gave their highest mean scores to TRAST, although some heterogeneity is present, and a smaller number of raters evaluated Grad-CAM++ more favorably in selected cases.
Figure 5b shows that the mean score of TRAST remained stable between the main-only raters and the extended raters, whereas CGFM and Grad-CAM++ were lower in the extended group.