Next Article in Journal
Comparison of Distortion-Product Otoacoustic Emissions Measured in the Same Subjects Using Four Commercial Systems
Previous Article in Journal
Preoperative Radiographic Thoracic Kyphosis Relates to Scapular Internal Rotation but Not Anterior Tilt in Candidates for Reverse Shoulder Arthroplasty: A Retrospective Radiographic Analysis from the FP-UCBM Shoulder Study Group
 
 
Systematic Review
Peer-Review Record

Diagnostic Accuracy of Artificial Intelligence in Predicting Anti-VEGF Treatment Response in Diabetic Macular Edema: A Systematic Review and Meta-Analysis

J. Clin. Med. 2025, 14(22), 8177; https://doi.org/10.3390/jcm14228177
by Faisal A. Al-Harbi 1,*, Mohanad A. Alkuwaiti 2, Meshari A. Alharbi 1, Ahmed A. Alessa 3, Ajwan A. Alhassan 4, Elan A. Aleidan 1, Fatimah Y. Al-Theyab 1, Mohammed Alfalah 4, Sajjad M. AlHaddad 5 and Ahmed Y. Azzam 6,7
Reviewer 1: Anonymous
Reviewer 2: Anonymous
J. Clin. Med. 2025, 14(22), 8177; https://doi.org/10.3390/jcm14228177
Submission received: 8 September 2025 / Revised: 10 October 2025 / Accepted: 14 October 2025 / Published: 18 November 2025
(This article belongs to the Section Ophthalmology)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This systematic review and meta-analysis addresses a timely and clinically relevant topic—the application of artificial intelligence in predicting anti-VEGF treatment response in diabetic macular edema. The study is well-structured, follows PRISMA guidelines, and synthesizes a growing body of literature. The findings suggest that AI models show promising diagnostic accuracy, with potential implications for personalized medicine and resource optimization.

  1. The included studies used widely varying definitions of “treatment response” (e.g., CMT reduction thresholds, VA improvements, composite outcomes). Pooling such heterogeneous outcomes may compromise the validity of the meta-analysis. It is strongly recommended that the authors perform subgroup analyses stratified by standardized outcome definitions (e.g., anatomical vs. functional response, or by threshold values) to better interpret the results and assess consistency.
  2. Several studies reported outcomes at the eye level, with some patients contributing both eyes. This may introduce clustering effects and inflate precision. The authors should clarify whether they accounted for within-patient correlation (e.g., by using appropriate statistical methods or excluding duplicate eyes) and discuss the potential impact on the results.
  3. The use of NOS and RoB 2 tools is not optimal for diagnostic accuracy or prediction model studies. Tools such as QUADAS-2 (for diagnostic accuracy) or PROBAST/PROBAST-AI (for prediction models) are more appropriate. The authors should re-assess the included studies using these tools to better evaluate sources of bias such as spectrum bias, reference standard inappropriateness, or data leakage.
  4. Some studies reported exceptionally high performance metrics (e.g., AUC = 0.9998). These values should be critically discussed in terms of clinical and methodological plausibility. Potential reasons (e.g., overfitting, small sample size, data leakage) should be explored, and the impact of these outliers on pooled estimates should be assessed via sensitivity analysis.
  5. More related literatures on AI based diagnostics (https://doi.org/10.1002/VIW.20240001; https://doi.org/10.1002/VIW.20240059; Nature Sustainability 2024, 7, 602) should be included and discussed.

Minor Revisions Suggested

  1. Numerous placeholders (e.g., “Figure 1. xxx”) and incomplete tables/figures detract from the manuscript’s professionalism. All figures and tables should be completed and referenced appropriately in the text.
  2. Tables frequently use “NR” (not reported), limiting transparency. The authors should indicate whether attempts were made to contact original study authors for missing data. If not, this should be acknowledged as a limitation.

Author Response

Point-by-Point Response to Reviewers' Comments

Dear Editor and Reviewers,

We sincerely thank both reviewers for their thorough evaluation and constructive feedback. We have carefully addressed all comments and made comprehensive revisions to strengthen our manuscript. Below is a detailed point-by-point response demonstrating how each concern has been resolved. Additionally, we have performed English language editing throughout the manuscript to improve clarity, readability, and professional presentation.

Item

Reviewer Comment

Our Response & Actions Taken

Location in Revised Manuscript

REVIEWER 1 - MAJOR COMMENTS

1.1

The included studies used widely varying definitions of "treatment response" (e.g., CMT reduction thresholds, VA improvements, composite outcomes). Pooling such heterogeneous outcomes may compromise the validity of the meta-analysis. It is strongly recommended that the authors perform subgroup analyses stratified by standardized outcome definitions to better interpret the results and assess consistency.

FActions Taken:

  1. Subgroup Analysis by Outcome Definition: We performed comprehensive subgroup analysis stratified by outcome definition type (anatomical-only, composite anatomical+functional, and image generation). Results show significant differences (p = 0.012):
    • Anatomical-only: 83.9% sensitivity
    • Composite outcomes: 100.0% sensitivity
    • Other (image generation): 81.8% sensitivity
  2. Forest Plot Visualization: Created Figure 2 showing sensitivity estimates stratified by outcome definition with clear visual representation of subgroup differences.
  3. Meta-Regression Analysis: Outcome definition emerged as a significant predictor of diagnostic performance variability (p = 0.012), explaining substantial heterogeneity.
  4. Supplementary Figure: Panel B of Supplementary Figure 1 visualizes the meta-regression results for outcome definition type.

Interpretation: Our analysis confirms that outcome definition significantly impacts reported accuracy, with composite outcomes showing highest sensitivity. This justifies stratified presentation and strengthens validity.

Results:

• Section 3.4: Lines describing subgroup p-value (p=0.012)
• Section 3.6: Meta-regression results
• Table 3: Complete subgroup analysis by outcome definition
• Figure 2: Forest plot stratified by outcome definition
• Supplementary Figure 1, Panel B: Meta-regression visualization

1.2

Several studies reported outcomes at the eye level, with some patients contributing both eyes. This may introduce clustering effects and inflate precision. The authors should clarify whether they accounted for within-patient correlation and discuss the potential impact on the results.

Actions Taken:

  1. Comprehensive Clustering Assessment: Conducted systematic evaluation of clustering effects across all 18 studies:
    • 7 studies (38.9%) included both eyes without adjustment
    • 4 studies (22.2%) included only one eye per patient
    • 7 studies (38.9%) had unclear clustering status
  2. Design Effect Calculation: Calculated design effects for studies with sufficient data, ranging from 1.38 to 1.89, indicating moderate to significant within-patient correlation.
  3. Meta-Analysis Subset Analysis: Among 6 meta-analysis studies:
    • 3 (50%) had confirmed/probable clustering without adjustment
    • 2 (33.3%) had no clustering concerns
    • 2 had unclear clustering status
  4. Sensitivity Analysis: Excluding studies with high clustering risk showed minimal impact on pooled estimates, confirming result robustness.
  5. Supplementary Documentation: Created Supplementary Table 3 with detailed clustering assessment for each study.
  6. Limitations Discussion: Added comprehensive paragraph acknowledging clustering as a systematic quality concern and recommending appropriate statistical methods (GEE, mixed-effects models) for future studies.

Methods:

• Section 2.5: PROBAST-AI assessment includes clustering evaluation

Results:

• Section 3.9: Complete clustering assessment paragraph with quantitative analysis
• Supplementary Table 3: Study-by-study clustering details

Discussion:

• Limitations paragraph: Clustering bias discussion with design effects and recommendations for future studies

1.3

The use of NOS and RoB 2 tools is not optimal for diagnostic accuracy or prediction model studies. Tools such as QUADAS-2 (for diagnostic accuracy) or PROBAST/PROBAST-AI (for prediction models) are more appropriate. The authors should re-assess the included studies using these tools to better evaluate sources of bias.

Actions Taken:

  1. Adopted PROBAST-AI Framework: Completely re-assessed all 18 studies using the Prediction model Risk Of Bias ASsessment Tool for AI (PROBAST-AI), which is specifically designed for AI-based prediction models.
  2. Four-Domain Assessment: Evaluated each study across:
    • Participants and data sources
    • Predictors (predictor definition, measurement consistency)
    • Outcome (definition, measurement, blinding, timing)
    • Analysis (sample size, missing data, overfitting, validation, data leakage)
  3. AI-Specific Concerns: Special attention to:
    • Both-eyes inclusion without clustering adjustment
    • Small test sets relative to model parameters
    • Absence of external validation
    • Data-driven outcome optimization
    • Implausibly high performance metrics (overfitting indicators)
  4. Risk Classification:
    • Low risk: 2 studies (11.1%)
    • Unclear risk: 9 studies (50.0%)
    • High risk: 7 studies (38.9%)
  5. Detailed Documentation: Created Supplementary Table 1 with domain-specific risk ratings for each study.

Methods:

• Section 2.5 (highlighted): Comprehensive description of PROBAST-AI framework, all four domains, and AI-specific concerns

Results:

• Section 3.9: Risk of bias results with percentages and classifications
• Supplementary Table 1: Complete PROBAST-AI assessment for all 18 studies

1.4

Some studies reported exceptionally high performance metrics (e.g., AUC = 0.9998). These values should be critically discussed in terms of clinical and methodological plausibility. Potential reasons (e.g., overfitting, small sample size, data leakage) should be explored, and the impact of these outliers on pooled estimates should be assessed via sensitivity analysis.


Actions Taken:

  1. Critical Methodological Discussion: Added comprehensive discussion paragraph specifically addressing Song et al. 2025 (AUC = 0.9998) as methodologically implausible for real-world clinical prediction.
  2. Identified Potential Causes: Explicitly discussed:
    • Overfitting to training data
    • Potential data leakage between training/test sets
    • Highly selective test populations not representative of clinical heterogeneity
  3. PROBAST-AI Rating: Rated Song et al. as high risk of bias in the analysis domain due to implausible performance metrics.
  4. Generalizability Concerns: Stated these results are unlikely to generalize to independent patient populations and should be interpreted with extreme caution until external validation demonstrated.
  5. Sensitivity Analysis: Performed leave-one-out sensitivity analysis (Table 6) showing stable pooled estimates when individual studies excluded, confirming minimal impact of outliers.
  6. Broader Implications: Used this as an example to emphasize the critical importance of external validation and realistic performance benchmarking in AI diagnostic studies.

Results:

• Table 6: Leave-one-out sensitivity analysis demonstrating stability

Discussion:

• Limitations section (highlighted paragraph): Comprehensive critical evaluation of Song et al. 2025 AUC = 0.9998 with mechanistic explanations for implausibility and generalizability concerns

1.5

More related literatures on AI based diagnostics should be included and discussed:

  • https://doi.org/10.1002/VIW.20240001
  • https://doi.org/10.1002/VIW.20240059
  • Nature Sustainability 2024, 7, 602

Actions Taken:

  1. VIEW Paper 1 - Multimodal Glioma Survival (doi: 10.1002/VIW.20240001):
    • Added as Reference 34
    • Cited in Discussion as blueprint for multimodal DME modeling
    • Discussed late-fusion architecture, survival-aware objectives, and multi-center validation importance
  2. VIEW Paper 2 - Ischemic Stroke ML Model (doi: 10.1002/VIW.20240059):
    • Added as Reference 33
    • Cited in Discussion alongside Reference 34
    • Integrated into multimodal algorithm discussion
  3. Nature Sustainability 2024, 7, 602:
    • NOT INCLUDED - After retrieval and careful review, this paper focuses on "A sustainable approach to universal metabolic cancer diagnosis"
    • Rationale for exclusion: The paper is about cancer metabolic diagnostics and sustainability in oncology screening, which is not relevant to retinal imaging, DME pathophysiology, or ophthalmologic AI applications
    • No appropriate context exists in our DME/anti-VEGF/retinal imaging manuscript to cite this work without forcing an irrelevant reference

4.     Added the suggested reference about Mura.

 

Net result: Added 3 highly relevant VIEW papers with substantive integration into Discussion. Excluded 1 paper after determining lack of relevance to study topic.

Discussion:

• Multimodal oncology paragraph (highlighted): References 33 and 34 discussed as blueprints for multimodal DME modeling, with specific architectural details and validation recommendations

References:

• Reference 33: Lyu et al., VIEW 2024 (stroke ML model)
• Reference 34: Yuan et al., VIEW 2024 (glioma multimodal survival prediction)

REVIEWER 1 - MINOR COMMENTS

1.M1

Numerous placeholders (e.g., "Figure 1. xxx") and incomplete tables/figures detract from the manuscript's professionalism. All figures and tables should be completed and referenced appropriately in the text.

Actions Taken:

  1. All 7 Main Figures Completed:
    • Figure 1: PRISMA flow diagram (complete)
    • Figure 2: Forest plot for sensitivity by outcome definition (complete)
    • Figure 3: Summary ROC curve (complete)
    • Figure 4: Bivariate performance plot (complete)
    • Figure 5: Funnel plot for publication bias (complete)
    • Figure 6: Clinical utility and implementation readiness plot (complete)
    • Figure 7: Decision curve analysis (complete)
  2. Supplementary Figure 1: Meta-regression plots with 4 panels (complete)
  3. All Figure Captions: Professional, descriptive captions added to each figure
  4. All Tables Complete: Tables 1-6 fully populated with data, no placeholders remaining
  5. Proper Referencing: Every figure and table appropriately referenced in the text at first mention

Throughout Manuscript:

• All figures have complete captions and professional formatting
• All tables fully populated
• All figures/tables referenced in Results sections at appropriate locations
• No "xxx" or placeholder text remains

1.M2

Tables frequently use "NR" (not reported), limiting transparency. The authors should indicate whether attempts were made to contact original study authors for missing data. If not, this should be acknowledged as a limitation.

Actions Taken:

  1. Transparency Acknowledged: Added explicit limitation paragraph stating: "An additional limitation is that corresponding authors were not contacted to obtain missing or unreported data elements (marked as 'NR' throughout tables), which may have limited the completeness of our data extraction and prevented more comprehensive subgroup analyses."
  2. Future Recommendations: Stated: "Future systematic reviews should incorporate author contact protocols to maximize data availability and reduce reporting gaps."
  3. NR Definition: Clarified in table abbreviations that "NR = not reported" refers to data not available in the original publications

This enhances transparency and acknowledges this methodological limitation appropriately.

Discussion:

• Limitations section (highlighted paragraph): Explicit acknowledgment of no author contact for missing data, with recognition this limited data completeness and recommendation for future systematic reviews

REVIEWER 2 - ALL COMMENTS

2.1

The article systematically evaluates the diagnostic accuracy of AI in predicting anti-VEGF treatment response in DME patients, but can further expand the search scope, supplement more currently published prospective cohorts or RCTs, and use individual participant data (IPD) meta-analysis to improve the level of evidence and result stability.

Actions Taken:

  1. Search Scope Transparency: Our comprehensive search (PubMed, Web of Science, Embase, Scopus, Cochrane Library from inception to September 2025) identified limited prospective studies in this emerging field:
    • Only 1 RCT found (Mondal et al. 2025)
    • 89% retrospective designs reflect early-stage research area
  2. IPD Meta-Analysis Limitation Acknowledged: Added comprehensive discussion paragraph:
    • Acknowledged absence of IPD meta-analysis
    • Explained potential benefits: more precise treatment effect estimation, adjustment for participant-level covariates, better handling of clustering and missing data
    • Stated IPD was not accessible, necessitating aggregate data approach
    • Recognized this limits precision of subgroup analyses and heterogeneity assessments
  3. Future Studies Recommendations: Added comprehensive paragraph calling for:
    • Large-scale, prospective, multicenter studies with pre-specified outcome definitions
    • Prospective validation studies to eliminate data leakage risk
    • Prospective trials comparing AI-guided vs. standard treatment pathways
    • Patient-relevant outcomes (visual function, quality of life, treatment burden)
  4. Methodological Justification: Explained our approach represents best available evidence synthesis given current literature landscape

Discussion:

• IPD limitation paragraph (highlighted): Comprehensive discussion of IPD meta-analysis absence, potential benefits, and impact on precision

• Future studies paragraph (highlighted): Detailed recommendations for prospective, multicenter studies with specific methodological requirements and outcome specifications

2.2

The AI model architecture, input modality, and hyperparameters used in the article have significant differences and moderate heterogeneity (I² ≈ 45%). Further consideration can be given to establishing a multi-center, publicly annotated DME-OCT benchmark dataset and conducting blind testing on all candidate models on this dataset to eliminate performance bias caused by device and population differences.

Actions Taken:

  1. Heterogeneity Acknowledgment: Thoroughly documented I² = 45.2% for sensitivity and explained sources through meta-regression analysis
  2. Comprehensive Benchmark Dataset Discussion: Added detailed future directions paragraph proposing:
    • Multi-center, publicly annotated DME-OCT benchmark dataset with standardized outcome definitions
    • Analogous to existing benchmarks: Explicitly referenced EyePACS and Messidor datasets for diabetic retinopathy as successful models
    • Dataset Requirements: Diverse patient populations, multiple OCT device manufacturers, various anti-VEGF agents and dosing protocols, consensus-defined response criteria
    • Purpose: Enable direct performance comparison across AI algorithms while minimizing device-specific and population-specific biases
    • Impact: Facilitate transparent comparison, identify truly generalizable architectures, accelerate clinical translation
    • Testing Framework: Researchers could test models on identical holdout sets for fair comparison
  3. Meta-Regression for Heterogeneity: Conducted comprehensive meta-regression identifying significant predictors (outcome definition, follow-up duration) and explained 78.4% of sensitivity variance

Results:

• Section 3.6: Meta-regression analysis addressing heterogeneity sources
• Table 4: Complete meta-regression results

Discussion:

• Benchmark dataset paragraph (highlighted): Comprehensive proposal for multi-center, publicly annotated DME-OCT benchmark dataset with specific design features, analogies to existing successful benchmarks, and expected impact on field advancement

2.3

The article can introduce decision curve analysis (DCA) and budget impact modeling (BIM) in subsequent research to compare the quality-adjusted life years (QALY) and incremental cost-effectiveness ratio (ICER) of "AI-assisted decision-making" and "conventional treatment" within 5 years, providing economic evidence for medical insurance payment and guideline revision.

Actions Taken:

  1. Decision Curve Analysis Performed: Conducted DCA comparing AI model vs. treat-all and treat-none strategies:
    • Created Figure 7 showing net benefit across threshold probabilities
    • Key Finding: AI provides net benefit when threshold probability <62%
    • Parameters: Sensitivity 86.4%, specificity 77.6%, prevalence 40%
    • Demonstrated clinical utility for intermediate-probability cases
  2. DCA Results Section: Added comprehensive paragraph in Section 3.11 describing DCA methodology and findings
  3. DCA Discussion Integration: Added two discussion paragraphs:
    • First paragraph: Clinical interpretation of 62% threshold and utility for risk stratification
    • Second paragraph: Framework for selective AI deployment in uncertain treatment decisions
  4. Cost-Effectiveness Data: Added to Table 5 and Section 3.7:
    • Possible cost savings: 15-30% through reduced injection frequency
    • Time savings: 40-60% in image analysis and workflow
    • Resource optimization documented across studies
  5. Economic Analysis Discussion: Integrated cost-effectiveness findings into Discussion section
  6. Future QALY/ICER Studies: While we couldn't perform prospective QALY/ICER analysis (requires longitudinal outcome data unavailable in included studies), we provided DCA and cost data as foundation for future economic evaluations

Results:

• Section 3.11 (highlighted): Complete DCA results with methodology and threshold findings
• Figure 7: Decision curve analysis plot
• Table 5: Cost-effectiveness data (15-30% savings, 40-60% time reduction)

Discussion:

• DCA interpretation paragraph (highlighted): Clinical utility explanation and threshold probability interpretation
• Cost-effectiveness discussion integrated

2.4

Multimodal algorithms have great inspiration and reference value for this task, such as:

  • Lightweight bilateral network of Mura detection on micro-OLED displays
  • Multi-task learning for hand heat trace time estimation and identity recognition
  • Deep soft threshold feature separation network for infrared handprint identity recognition and time estimation

If the existing research in this article does not involve multimodal algorithms, the above-mentioned papers should also be mentioned. The article can further explore the hybrid 3D-CNN or Vision Transformer architecture of OCT and OCT-A fusion structures, verify whether vascular features such as blood flow density and non-perfusion area volume can further improve the prediction of AUC, and explain their biological rationality.

Actions Taken:

  1. All Three Papers Cited:
    • Reference 31: Yu et al. - Deep soft threshold feature separation network (Infrared Physics & Technology 2024)
    • Reference 32: Yu et al. - Multi-task learning for hand heat trace (Expert Systems with Applications 2024)
    • Reference 38: Lightweight bilateral network of Mura detection on micro-OLED displays
  2. Multimodal Algorithm Discussion Integration: Added comprehensive paragraph discussing:
    • Multi-task learning with explicit feature separation (inspired by Refs 31, 32, 38)
    • Application to DME: jointly classify responder status AND forecast temporal endpoints (time-to-dry macula, durability between injections)
    • Architectural features: soft-threshold/shrinkage blocks, balanced losses
    • Validation requirements: multi-center, prospective cohorts
  3. OCT-A Fusion Architecture Discussion: Added two comprehensive paragraphs:
    • First paragraph (Discussion, mid-section, highlighted):
      • Hybrid 3D-CNN and Vision Transformer architectures for OCT-A integration
      • Vascular Features: Macular perfusion density, vessel-length density, FAZ area/circularity, capillary non-perfusion volume
      • Evidence: Cited 3 papers (Refs 35-37) demonstrating association with treatment response
      • Biological Rationale: Explained pathophysiology - retinal ischemia and capillary dropout drive VEGF upregulation; severe baseline ischemia → attenuated anti-VEGF response; edema mechanism extends beyond VEGF-mediated permeability to structural microvascular loss
      • Multimodal Fusion Model: Structural features (fluid compartments, photoreceptor integrity) + vascular parameters (perfusion metrics) + clinical data (diabetes duration, HbA1c, baseline VA)
      • Validation Caveat: External validation needed to establish clinical benefit vs. algorithmic complexity
    • Second paragraph (Discussion, Future Directions, highlighted):
      • Detailed OCT-A integration proposal
      • Vascular features as prognostic information
      • Impaired perfusion association with suboptimal outcomes
      • Prospective comparison studies needed
  4. AUC Improvement Discussion: Addressed whether vascular features can improve AUC, with appropriate caution that validation is needed to confirm clinical utility

Discussion:

• Infrared-thermography papers paragraph (highlighted): Multi-task learning discussion citing Refs 31, 32, 38 with architectural details and DME application

• OCT-A integration paragraph 1 (highlighted, mid-Discussion): Comprehensive discussion of hybrid 3D-CNN/Vision Transformer for OCT-A fusion, specific vascular parameters, biological rationale for ischemia-response relationship, multimodal fusion model components, validation requirements (cites Refs 35-37)

• OCT-A integration paragraph 2 (highlighted, Future Directions): Detailed proposal for prospective OCT-A studies

References:

• Ref 31, 32, 38: Multimodal algorithm papers
• Ref 35-37: OCT-A vascular features papers

ADDITIONAL IMPROVEMENTS NOT REQUESTED BUT IMPLEMENTED

A.1

Reviewer 2 English language editing concern

We performed thorough English language editing throughout the entire manuscript to improve:

  • Clarity: Restructured complex sentences for better readability
  • Grammar: Corrected grammatical structures and verb tenses
  • Professional Tone: Enhanced scientific writing style consistency
  • Precision: Replaced ambiguous terms with specific terminology
  • Flow: Improved transitions between sections and paragraphs
  • Conciseness: Eliminated redundant phrases while maintaining completeness

This enhances overall manuscript quality and readability for international audiences.

Throughout entire manuscript:
• Abstract
• Introduction
• Methods
• Results (all sections)
• Discussion
• Conclusion
• Figure captions

Summary Statement:
We have systematically addressed all major and minor comments from both reviewers with comprehensive revisions. All requested analyses have been performed, appropriate statistical tools implemented (PROBAST-AI, DCA), critical methodological discussions added (clustering, high AUC values, IPD limitations), and all requested literature integrated with substantive discussion. Additionally, we performed extensive English language editing and data accuracy verification. The revised manuscript now provides robust, transparent, and comprehensive evidence for AI-based prediction of anti-VEGF treatment response in DME patients, with clear acknowledgment of limitations and detailed future research directions. We believe these revisions have substantially strengthened the manuscript and addressed all reviewer concerns comprehensively.
We sincerely thank both reviewers for their constructive feedback, which has significantly improved the quality and rigor of our manuscript.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

  1. The article systematically evaluates the diagnostic accuracy of AI in predicting anti VEGF treatment response in DME patients, but can further expand the search scope, supplement more currently published prospective cohorts or RCTs, and use individual case data (IPD) meta-analysis to improve the level of evidence and result stability.
  2. The AI model architecture, input modality, and hyperparameters used in the article have significant differences and moderate heterogeneity (I ² ≈ 45%). Further consideration can be given to establishing a multi center, publicly annotated DME-OCT benchmark dataset and conducting blind testing on all candidate models on this dataset to eliminate performance bias caused by device and population differences.
  3. The article can introduce decision curve analysis (DCA) and budget impact modeling (BIM) in subsequent research to compare the quality adjusted life years (QALY) and incremental cost-effectiveness ratio (ICER) of "AI assisted decision-making" and "conventional treatment" within 5 years, providing economic evidence for medical insurance payment and guideline revision.

4. Multimodal algorithms have great inspiration and reference value for this task, such as Lightweight bilateral network of Mura detection on micro-OLED displays, Multi-task learning for hand heat trace time estimation and identity recognition, Deep soft threshold feature separation network for infrared handprint identity recognition and time estimation. If the existing research in this article does not involve multimodal algorithms, the above-mentioned papers should also be mentioned. The article can further explore the hybrid 3D-CNN or Vision Transformer architecture of OCT and OCT-A fusion structures, verify whether vascular features such as blood flow density and non perfusion area volume can further improve the prediction of AUC, and explain their biological rationality.

Author Response

Point-by-Point Response to Reviewers' Comments

Dear Editor and Reviewers,
We sincerely thank both reviewers for their thorough evaluation and constructive feedback. We have carefully addressed all comments and made comprehensive revisions to strengthen our manuscript. Below is a detailed point-by-point response demonstrating how each concern has been resolved. Additionally, we have performed English language editing throughout the manuscript to improve clarity, readability, and professional presentation.

Item

Reviewer Comment

Our Response & Actions Taken

Location in Revised Manuscript

REVIEWER 1 - MAJOR COMMENTS

1.1

The included studies used widely varying definitions of "treatment response" (e.g., CMT reduction thresholds, VA improvements, composite outcomes). Pooling such heterogeneous outcomes may compromise the validity of the meta-analysis. It is strongly recommended that the authors perform subgroup analyses stratified by standardized outcome definitions to better interpret the results and assess consistency.

FActions Taken:

  1. Subgroup Analysis by Outcome Definition: We performed comprehensive subgroup analysis stratified by outcome definition type (anatomical-only, composite anatomical+functional, and image generation). Results show significant differences (p = 0.012):
    • Anatomical-only: 83.9% sensitivity
    • Composite outcomes: 100.0% sensitivity
    • Other (image generation): 81.8% sensitivity
  2. Forest Plot Visualization: Created Figure 2 showing sensitivity estimates stratified by outcome definition with clear visual representation of subgroup differences.
  3. Meta-Regression Analysis: Outcome definition emerged as a significant predictor of diagnostic performance variability (p = 0.012), explaining substantial heterogeneity.
  4. Supplementary Figure: Panel B of Supplementary Figure 1 visualizes the meta-regression results for outcome definition type.

Interpretation: Our analysis confirms that outcome definition significantly impacts reported accuracy, with composite outcomes showing highest sensitivity. This justifies stratified presentation and strengthens validity.

Results:

• Section 3.4: Lines describing subgroup p-value (p=0.012)
• Section 3.6: Meta-regression results
• Table 3: Complete subgroup analysis by outcome definition
• Figure 2: Forest plot stratified by outcome definition
• Supplementary Figure 1, Panel B: Meta-regression visualization

1.2

Several studies reported outcomes at the eye level, with some patients contributing both eyes. This may introduce clustering effects and inflate precision. The authors should clarify whether they accounted for within-patient correlation and discuss the potential impact on the results.

Actions Taken:

  1. Comprehensive Clustering Assessment: Conducted systematic evaluation of clustering effects across all 18 studies:
    • 7 studies (38.9%) included both eyes without adjustment
    • 4 studies (22.2%) included only one eye per patient
    • 7 studies (38.9%) had unclear clustering status
  2. Design Effect Calculation: Calculated design effects for studies with sufficient data, ranging from 1.38 to 1.89, indicating moderate to significant within-patient correlation.
  3. Meta-Analysis Subset Analysis: Among 6 meta-analysis studies:
    • 3 (50%) had confirmed/probable clustering without adjustment
    • 2 (33.3%) had no clustering concerns
    • 2 had unclear clustering status
  4. Sensitivity Analysis: Excluding studies with high clustering risk showed minimal impact on pooled estimates, confirming result robustness.
  5. Supplementary Documentation: Created Supplementary Table 3 with detailed clustering assessment for each study.
  6. Limitations Discussion: Added comprehensive paragraph acknowledging clustering as a systematic quality concern and recommending appropriate statistical methods (GEE, mixed-effects models) for future studies.

Methods:

• Section 2.5: PROBAST-AI assessment includes clustering evaluation

Results:

• Section 3.9: Complete clustering assessment paragraph with quantitative analysis
• Supplementary Table 3: Study-by-study clustering details

Discussion:

• Limitations paragraph: Clustering bias discussion with design effects and recommendations for future studies

1.3

The use of NOS and RoB 2 tools is not optimal for diagnostic accuracy or prediction model studies. Tools such as QUADAS-2 (for diagnostic accuracy) or PROBAST/PROBAST-AI (for prediction models) are more appropriate. The authors should re-assess the included studies using these tools to better evaluate sources of bias.

Actions Taken:

  1. Adopted PROBAST-AI Framework: Completely re-assessed all 18 studies using the Prediction model Risk Of Bias ASsessment Tool for AI (PROBAST-AI), which is specifically designed for AI-based prediction models.
  2. Four-Domain Assessment: Evaluated each study across:
    • Participants and data sources
    • Predictors (predictor definition, measurement consistency)
    • Outcome (definition, measurement, blinding, timing)
    • Analysis (sample size, missing data, overfitting, validation, data leakage)
  3. AI-Specific Concerns: Special attention to:
    • Both-eyes inclusion without clustering adjustment
    • Small test sets relative to model parameters
    • Absence of external validation
    • Data-driven outcome optimization
    • Implausibly high performance metrics (overfitting indicators)
  4. Risk Classification:
    • Low risk: 2 studies (11.1%)
    • Unclear risk: 9 studies (50.0%)
    • High risk: 7 studies (38.9%)
  5. Detailed Documentation: Created Supplementary Table 1 with domain-specific risk ratings for each study.

Methods:

• Section 2.5 (highlighted): Comprehensive description of PROBAST-AI framework, all four domains, and AI-specific concerns

Results:

• Section 3.9: Risk of bias results with percentages and classifications
• Supplementary Table 1: Complete PROBAST-AI assessment for all 18 studies

1.4

Some studies reported exceptionally high performance metrics (e.g., AUC = 0.9998). These values should be critically discussed in terms of clinical and methodological plausibility. Potential reasons (e.g., overfitting, small sample size, data leakage) should be explored, and the impact of these outliers on pooled estimates should be assessed via sensitivity analysis.


Actions Taken:

  1. Critical Methodological Discussion: Added comprehensive discussion paragraph specifically addressing Song et al. 2025 (AUC = 0.9998) as methodologically implausible for real-world clinical prediction.
  2. Identified Potential Causes: Explicitly discussed:
    • Overfitting to training data
    • Potential data leakage between training/test sets
    • Highly selective test populations not representative of clinical heterogeneity
  3. PROBAST-AI Rating: Rated Song et al. as high risk of bias in the analysis domain due to implausible performance metrics.
  4. Generalizability Concerns: Stated these results are unlikely to generalize to independent patient populations and should be interpreted with extreme caution until external validation demonstrated.
  5. Sensitivity Analysis: Performed leave-one-out sensitivity analysis (Table 6) showing stable pooled estimates when individual studies excluded, confirming minimal impact of outliers.
  6. Broader Implications: Used this as an example to emphasize the critical importance of external validation and realistic performance benchmarking in AI diagnostic studies.

Results:

• Table 6: Leave-one-out sensitivity analysis demonstrating stability

Discussion:

• Limitations section (highlighted paragraph): Comprehensive critical evaluation of Song et al. 2025 AUC = 0.9998 with mechanistic explanations for implausibility and generalizability concerns

1.5

More related literatures on AI based diagnostics should be included and discussed:

  • https://doi.org/10.1002/VIW.20240001
  • https://doi.org/10.1002/VIW.20240059
  • Nature Sustainability 2024, 7, 602

Actions Taken:

  1. VIEW Paper 1 - Multimodal Glioma Survival (doi: 10.1002/VIW.20240001):
    • Added as Reference 34
    • Cited in Discussion as blueprint for multimodal DME modeling
    • Discussed late-fusion architecture, survival-aware objectives, and multi-center validation importance
  2. VIEW Paper 2 - Ischemic Stroke ML Model (doi: 10.1002/VIW.20240059):
    • Added as Reference 33
    • Cited in Discussion alongside Reference 34
    • Integrated into multimodal algorithm discussion
  3. Nature Sustainability 2024, 7, 602:
    • NOT INCLUDED - After retrieval and careful review, this paper focuses on "A sustainable approach to universal metabolic cancer diagnosis"
    • Rationale for exclusion: The paper is about cancer metabolic diagnostics and sustainability in oncology screening, which is not relevant to retinal imaging, DME pathophysiology, or ophthalmologic AI applications
    • No appropriate context exists in our DME/anti-VEGF/retinal imaging manuscript to cite this work without forcing an irrelevant reference

4.     Added the suggested reference about Mura.

 

Net result: Added 3 highly relevant VIEW papers with substantive integration into Discussion. Excluded 1 paper after determining lack of relevance to study topic.

Discussion:

• Multimodal oncology paragraph (highlighted): References 33 and 34 discussed as blueprints for multimodal DME modeling, with specific architectural details and validation recommendations

References:

• Reference 33: Lyu et al., VIEW 2024 (stroke ML model)
• Reference 34: Yuan et al., VIEW 2024 (glioma multimodal survival prediction)

REVIEWER 1 - MINOR COMMENTS

1.M1

Numerous placeholders (e.g., "Figure 1. xxx") and incomplete tables/figures detract from the manuscript's professionalism. All figures and tables should be completed and referenced appropriately in the text.

Actions Taken:

  1. All 7 Main Figures Completed:
    • Figure 1: PRISMA flow diagram (complete)
    • Figure 2: Forest plot for sensitivity by outcome definition (complete)
    • Figure 3: Summary ROC curve (complete)
    • Figure 4: Bivariate performance plot (complete)
    • Figure 5: Funnel plot for publication bias (complete)
    • Figure 6: Clinical utility and implementation readiness plot (complete)
    • Figure 7: Decision curve analysis (complete)
  2. Supplementary Figure 1: Meta-regression plots with 4 panels (complete)
  3. All Figure Captions: Professional, descriptive captions added to each figure
  4. All Tables Complete: Tables 1-6 fully populated with data, no placeholders remaining
  5. Proper Referencing: Every figure and table appropriately referenced in the text at first mention

Throughout Manuscript:

• All figures have complete captions and professional formatting
• All tables fully populated
• All figures/tables referenced in Results sections at appropriate locations
• No "xxx" or placeholder text remains

1.M2

Tables frequently use "NR" (not reported), limiting transparency. The authors should indicate whether attempts were made to contact original study authors for missing data. If not, this should be acknowledged as a limitation.

Actions Taken:

  1. Transparency Acknowledged: Added explicit limitation paragraph stating: "An additional limitation is that corresponding authors were not contacted to obtain missing or unreported data elements (marked as 'NR' throughout tables), which may have limited the completeness of our data extraction and prevented more comprehensive subgroup analyses."
  2. Future Recommendations: Stated: "Future systematic reviews should incorporate author contact protocols to maximize data availability and reduce reporting gaps."
  3. NR Definition: Clarified in table abbreviations that "NR = not reported" refers to data not available in the original publications

This enhances transparency and acknowledges this methodological limitation appropriately.

Discussion:

• Limitations section (highlighted paragraph): Explicit acknowledgment of no author contact for missing data, with recognition this limited data completeness and recommendation for future systematic reviews

REVIEWER 2 - ALL COMMENTS

2.1

The article systematically evaluates the diagnostic accuracy of AI in predicting anti-VEGF treatment response in DME patients, but can further expand the search scope, supplement more currently published prospective cohorts or RCTs, and use individual participant data (IPD) meta-analysis to improve the level of evidence and result stability.

Actions Taken:

  1. Search Scope Transparency: Our comprehensive search (PubMed, Web of Science, Embase, Scopus, Cochrane Library from inception to September 2025) identified limited prospective studies in this emerging field:
    • Only 1 RCT found (Mondal et al. 2025)
    • 89% retrospective designs reflect early-stage research area
  2. IPD Meta-Analysis Limitation Acknowledged: Added comprehensive discussion paragraph:
    • Acknowledged absence of IPD meta-analysis
    • Explained potential benefits: more precise treatment effect estimation, adjustment for participant-level covariates, better handling of clustering and missing data
    • Stated IPD was not accessible, necessitating aggregate data approach
    • Recognized this limits precision of subgroup analyses and heterogeneity assessments
  3. Future Studies Recommendations: Added comprehensive paragraph calling for:
    • Large-scale, prospective, multicenter studies with pre-specified outcome definitions
    • Prospective validation studies to eliminate data leakage risk
    • Prospective trials comparing AI-guided vs. standard treatment pathways
    • Patient-relevant outcomes (visual function, quality of life, treatment burden)
  4. Methodological Justification: Explained our approach represents best available evidence synthesis given current literature landscape

Discussion:

• IPD limitation paragraph (highlighted): Comprehensive discussion of IPD meta-analysis absence, potential benefits, and impact on precision

• Future studies paragraph (highlighted): Detailed recommendations for prospective, multicenter studies with specific methodological requirements and outcome specifications

2.2

The AI model architecture, input modality, and hyperparameters used in the article have significant differences and moderate heterogeneity (I² ≈ 45%). Further consideration can be given to establishing a multi-center, publicly annotated DME-OCT benchmark dataset and conducting blind testing on all candidate models on this dataset to eliminate performance bias caused by device and population differences.

Actions Taken:

  1. Heterogeneity Acknowledgment: Thoroughly documented I² = 45.2% for sensitivity and explained sources through meta-regression analysis
  2. Comprehensive Benchmark Dataset Discussion: Added detailed future directions paragraph proposing:
    • Multi-center, publicly annotated DME-OCT benchmark dataset with standardized outcome definitions
    • Analogous to existing benchmarks: Explicitly referenced EyePACS and Messidor datasets for diabetic retinopathy as successful models
    • Dataset Requirements: Diverse patient populations, multiple OCT device manufacturers, various anti-VEGF agents and dosing protocols, consensus-defined response criteria
    • Purpose: Enable direct performance comparison across AI algorithms while minimizing device-specific and population-specific biases
    • Impact: Facilitate transparent comparison, identify truly generalizable architectures, accelerate clinical translation
    • Testing Framework: Researchers could test models on identical holdout sets for fair comparison
  3. Meta-Regression for Heterogeneity: Conducted comprehensive meta-regression identifying significant predictors (outcome definition, follow-up duration) and explained 78.4% of sensitivity variance

Results:

• Section 3.6: Meta-regression analysis addressing heterogeneity sources
• Table 4: Complete meta-regression results

Discussion:

• Benchmark dataset paragraph (highlighted): Comprehensive proposal for multi-center, publicly annotated DME-OCT benchmark dataset with specific design features, analogies to existing successful benchmarks, and expected impact on field advancement

2.3

The article can introduce decision curve analysis (DCA) and budget impact modeling (BIM) in subsequent research to compare the quality-adjusted life years (QALY) and incremental cost-effectiveness ratio (ICER) of "AI-assisted decision-making" and "conventional treatment" within 5 years, providing economic evidence for medical insurance payment and guideline revision.

Actions Taken:

  1. Decision Curve Analysis Performed: Conducted DCA comparing AI model vs. treat-all and treat-none strategies:
    • Created Figure 7 showing net benefit across threshold probabilities
    • Key Finding: AI provides net benefit when threshold probability <62%
    • Parameters: Sensitivity 86.4%, specificity 77.6%, prevalence 40%
    • Demonstrated clinical utility for intermediate-probability cases
  2. DCA Results Section: Added comprehensive paragraph in Section 3.11 describing DCA methodology and findings
  3. DCA Discussion Integration: Added two discussion paragraphs:
    • First paragraph: Clinical interpretation of 62% threshold and utility for risk stratification
    • Second paragraph: Framework for selective AI deployment in uncertain treatment decisions
  4. Cost-Effectiveness Data: Added to Table 5 and Section 3.7:
    • Possible cost savings: 15-30% through reduced injection frequency
    • Time savings: 40-60% in image analysis and workflow
    • Resource optimization documented across studies
  5. Economic Analysis Discussion: Integrated cost-effectiveness findings into Discussion section
  6. Future QALY/ICER Studies: While we couldn't perform prospective QALY/ICER analysis (requires longitudinal outcome data unavailable in included studies), we provided DCA and cost data as foundation for future economic evaluations

Results:

• Section 3.11 (highlighted): Complete DCA results with methodology and threshold findings
• Figure 7: Decision curve analysis plot
• Table 5: Cost-effectiveness data (15-30% savings, 40-60% time reduction)

Discussion:

• DCA interpretation paragraph (highlighted): Clinical utility explanation and threshold probability interpretation
• Cost-effectiveness discussion integrated

2.4

Multimodal algorithms have great inspiration and reference value for this task, such as:

  • Lightweight bilateral network of Mura detection on micro-OLED displays
  • Multi-task learning for hand heat trace time estimation and identity recognition
  • Deep soft threshold feature separation network for infrared handprint identity recognition and time estimation

If the existing research in this article does not involve multimodal algorithms, the above-mentioned papers should also be mentioned. The article can further explore the hybrid 3D-CNN or Vision Transformer architecture of OCT and OCT-A fusion structures, verify whether vascular features such as blood flow density and non-perfusion area volume can further improve the prediction of AUC, and explain their biological rationality.

Actions Taken:

  1. All Three Papers Cited:
    • Reference 31: Yu et al. - Deep soft threshold feature separation network (Infrared Physics & Technology 2024)
    • Reference 32: Yu et al. - Multi-task learning for hand heat trace (Expert Systems with Applications 2024)
    • Reference 38: Lightweight bilateral network of Mura detection on micro-OLED displays
  2. Multimodal Algorithm Discussion Integration: Added comprehensive paragraph discussing:
    • Multi-task learning with explicit feature separation (inspired by Refs 31, 32, 38)
    • Application to DME: jointly classify responder status AND forecast temporal endpoints (time-to-dry macula, durability between injections)
    • Architectural features: soft-threshold/shrinkage blocks, balanced losses
    • Validation requirements: multi-center, prospective cohorts
  3. OCT-A Fusion Architecture Discussion: Added two comprehensive paragraphs:
    • First paragraph (Discussion, mid-section, highlighted):
      • Hybrid 3D-CNN and Vision Transformer architectures for OCT-A integration
      • Vascular Features: Macular perfusion density, vessel-length density, FAZ area/circularity, capillary non-perfusion volume
      • Evidence: Cited 3 papers (Refs 35-37) demonstrating association with treatment response
      • Biological Rationale: Explained pathophysiology - retinal ischemia and capillary dropout drive VEGF upregulation; severe baseline ischemia → attenuated anti-VEGF response; edema mechanism extends beyond VEGF-mediated permeability to structural microvascular loss
      • Multimodal Fusion Model: Structural features (fluid compartments, photoreceptor integrity) + vascular parameters (perfusion metrics) + clinical data (diabetes duration, HbA1c, baseline VA)
      • Validation Caveat: External validation needed to establish clinical benefit vs. algorithmic complexity
    • Second paragraph (Discussion, Future Directions, highlighted):
      • Detailed OCT-A integration proposal
      • Vascular features as prognostic information
      • Impaired perfusion association with suboptimal outcomes
      • Prospective comparison studies needed
  4. AUC Improvement Discussion: Addressed whether vascular features can improve AUC, with appropriate caution that validation is needed to confirm clinical utility

Discussion:

• Infrared-thermography papers paragraph (highlighted): Multi-task learning discussion citing Refs 31, 32, 38 with architectural details and DME application

• OCT-A integration paragraph 1 (highlighted, mid-Discussion): Comprehensive discussion of hybrid 3D-CNN/Vision Transformer for OCT-A fusion, specific vascular parameters, biological rationale for ischemia-response relationship, multimodal fusion model components, validation requirements (cites Refs 35-37)

• OCT-A integration paragraph 2 (highlighted, Future Directions): Detailed proposal for prospective OCT-A studies

References:

• Ref 31, 32, 38: Multimodal algorithm papers
• Ref 35-37: OCT-A vascular features papers

ADDITIONAL IMPROVEMENTS NOT REQUESTED BUT IMPLEMENTED

A.1

Reviewer 2 English language editing concern

We performed thorough English language editing throughout the entire manuscript to improve:

  • Clarity: Restructured complex sentences for better readability
  • Grammar: Corrected grammatical structures and verb tenses
  • Professional Tone: Enhanced scientific writing style consistency
  • Precision: Replaced ambiguous terms with specific terminology
  • Flow: Improved transitions between sections and paragraphs
  • Conciseness: Eliminated redundant phrases while maintaining completeness

This enhances overall manuscript quality and readability for international audiences.

Throughout entire manuscript:
• Abstract
• Introduction
• Methods
• Results (all sections)
• Discussion
• Conclusion
• Figure captions

Summary Statement:
We have systematically addressed all major and minor comments from both reviewers with comprehensive revisions. All requested analyses have been performed, appropriate statistical tools implemented (PROBAST-AI, DCA), critical methodological discussions added (clustering, high AUC values, IPD limitations), and all requested literature integrated with substantive discussion. Additionally, we performed extensive English language editing and data accuracy verification. The revised manuscript now provides robust, transparent, and comprehensive evidence for AI-based prediction of anti-VEGF treatment response in DME patients, with clear acknowledgment of limitations and detailed future research directions. We believe these revisions have substantially strengthened the manuscript and addressed all reviewer concerns comprehensively.
We sincerely thank both reviewers for their constructive feedback, which has significantly improved the quality and rigor of our manuscript.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Reviewer 2 Report

Comments and Suggestions for Authors

The revised version has a very good improvement in algorithm and logic. I warmly recommend publication in present form.

Back to TopTop