Next Article in Journal
COVAS: Highlighting the Importance of Outliers in Classification Through Explainable AI
Previous Article in Journal
Robot Planning via LLM Proposals and Symbolic Verification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Artificial Intelligence for Predicting Treatment Response in Neovascular Age Macular Degeneration with Anti-VEGF: A Systematic Review and Meta-Analysis

1
Department of Ophthalmology, Kaohsiung Veterans General Hospital, Kaohsiung 813414, Taiwan
2
Department of Medical Education, Taipei Veterans General Hospital, Taipei 112304, Taiwan
3
School of Medicine, National Yang-Ming Chiao Tung University, Taipei 112304, Taiwan
4
Institute of Biophotonics, National Yang-Ming Chiao Tung University, Taipei 112304, Taiwan
5
Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2026, 8(1), 23; https://doi.org/10.3390/make8010023
Submission received: 26 December 2025 / Revised: 15 January 2026 / Accepted: 16 January 2026 / Published: 19 January 2026

Abstract

Age-related macular degeneration (AMD) is a leading cause of irreversible vision loss; anti-vascular endothelial growth factor (anti-VEGF) therapy is standard care for neovascular AMD (nAMD), yet treatment response varies. We systematically reviewed and meta-analyzed artificial intelligence (AI) and machine learning (ML) models using optical coherence tomography (OCT)-derived information to predict anti-VEGF treatment response in nAMD. PubMed, Embase, Web of Science, and IEEE Xplore were searched from inception to 18 December 2025 for eligible studies reporting threshold-based performance. Two reviewers screened studies, extracted data, and assessed risk of bias using PROBAST+AI; pooled sensitivity and specificity were estimated with a bivariate random-effects model. Seven studies met inclusion criteria, and six were synthesized quantitatively. Pooled sensitivity was 0.79 (95% CI 0.68–0.87), and pooled specificity was 0.83 (95% CI 0.62–0.94), with substantial heterogeneity. Specificity tended to be higher for long-term and functional outcomes than for short-term and anatomical outcomes. Most studies had a high risk of bias, mainly due to limited external validation and incomplete reporting. OCT-based AI models may help stratify treatment response in nAMD, but prospective, multicenter validation and standardized outcome definitions are needed before routine use; current evidence shows no consistent advantage of deep learning over engineered radiomic features.

1. Introduction

Age-related macular degeneration (AMD) is a leading cause of irreversible vision loss worldwide, and the neovascular (“wet”) subtype (nAMD) accounts for most cases of severe, rapid visual deterioration. The global prevalence and projected burden of AMD continue to rise with population aging, making nAMD a persistent public health and clinical workload challenge [1,2]. Anti-vascular endothelial growth factor (anti-VEGF) therapy has transformed nAMD outcomes and remains the cornerstone of modern management, with landmark trials demonstrating clinically meaningful visual benefit in many patients [3,4,5]. Yet, treatment response is heterogeneous: some eyes achieve robust anatomical improvement and visual stabilization, while others show incomplete or variable response despite receiving therapy. This variability matters because early identification of likely responders vs. limited responders could support more personalized care pathways, reduce avoidable clinic burden, and help clinicians counsel patients with greater precision [3,4,5].
Optical coherence tomography (OCT) is central to nAMD assessment because it provides high-resolution cross-sectional visualization of the macula and enables quantification of disease activity through imaging phenotypes (for example, intraretinal fluid (IRF), subretinal fluid (SRF), and pigment epithelial detachment (PED) patterns). Standardized OCT terminology is particularly important for aggregating evidence across studies and devices, and consensus nomenclature has helped harmonize how neovascular AMD features are reported [6]. In parallel, a substantial literature describes OCT-based biomarkers and their prognostic implications, including reviews and systematic syntheses evaluating how fluid compartments and structural changes relate to functional outcomes [7,8,9].
The technical opportunity is that OCT is not only “readable” by humans; it is also machine readable at scale. Feature engineering and knowledge extraction from imaging—turning patterns into measurable variables—has matured across medical imaging, including the radiomics paradigm (images as high-dimensional data rather than pictures) [10]. For OCT in nAMD, this includes both conventional biomarkers (thickness and fluid presence/volume) and higher-order texture/shape descriptors that may capture subtle disease states not easily summarized by a single thickness measure [7,8,9,10].
Over the last decade, artificial intelligence (AI)—particularly machine learning (ML) and deep learning—has shown strong performance in ophthalmic imaging tasks such as diagnosis and triage from OCT and other retinal modalities [11,12]. Deep learning has also enabled automated OCT quantification (e.g., fluid detection/segmentation), which is a key prerequisite for building scalable predictive systems [13].
A natural next step beyond detection is prediction: using baseline (or near-baseline) OCT information to anticipate treatment response categories. Multiple approaches have been explored, including multimodal ML using OCT plus clinical variables, purely image-based deep networks, radiomics pipelines, generative approaches, and self-supervised representation learning [14,15,16,17,18,19]. For example, ML models have been trained to predict visual outcomes from baseline OCT and clinical variables [14], radiomic features have been investigated as predictors of response and durability [15,16], generative models have been used for agent-specific outcome prediction [17], and self-supervised OCT feature learning has been applied to classify treatment effectiveness with high discriminative performance [18]. In routine care settings, ML has also been used to predict limited early response using baseline characteristics with feature importance analyses to improve interpretability [19].
Despite rapid progress, the evidence base is fragmented across datasets, OCT acquisition protocols, outcome definitions, modeling choices, and reporting practices. Importantly, many studies emphasize threshold-independent metrics (e.g., area under the receiver operating characteristic curve (AUROC)), while clinicians often need thresholded performance that maps onto decisions: “Does this model correctly flag a likely limited responder?” That is exactly where sensitivity (true-positive rate) and specificity (true-negative rate) become clinically legible. Sensitivity is aligned with minimizing missed limited responders (false negatives), while specificity helps avoid unnecessary changes in management for eyes that would respond well (false positives). This makes sensitivity/specificity a pragmatic basis for evidence synthesis when the intended use resembles a clinical classification or triage problem rather than a purely comparative modeling benchmark.
Accordingly, there is a strong rationale for a systematic review and meta-analysis focused on OCT-based AI prediction of treatment response in nAMD using sensitivity and specificity as the primary meta-analyzable endpoints. The methodological toolkit for synthesizing diagnostic/predictive accuracy (jointly pooling sensitivity and specificity while accounting for their trade-off) is well established. To maximize interpretability and reduce confounding by clinical practice variation, this review is framed around model performance as reported, without attempting to compare or adjust for follow-up schedules or specific treatment regimens (i.e., the synthesis targets predictive discrimination rather than regimen effect estimation). By pooling results across studies, we aim to extract generalizable insights into which modeling strategies and features are most effective, providing guidance for both clinicians and ML researchers on the design of future predictive models.

2. Materials and Methods

2.1. General Guideline

This systematic review evaluates prognostic prediction models that use baseline OCT-derived information to predict future anti-VEGF treatment response. We report the review following PRISMA principles, and because our primary meta-analyzable endpoints were threshold-based sensitivity and specificity (2 × 2 tables), we additionally followed PRISMA-DTA [20] items relevant to test accuracy synthesis and applied a bivariate random-effects model for joint pooling.
Methodological rigor and transparency were ensured through systematic application of the PRISMA checklist across all phases of study identification, selection, data extraction, and reporting (see Supplementary Tables S1 and S2). The review protocol was prospectively registered in the INPLASY database (registration number: INPLASY2025120086) [21]. As this study synthesized data from previously published literature and involved no direct interaction with human participants, ethical approval and informed consent were not required.

2.2. Database Searches and Identification of Eligible Manuscripts

Two reviewers (L.-W.T. and T.-W.W.) independently conducted a systematic literature search to identify studies evaluating artificial intelligence-based models for predicting treatment response in neovascular age-related macular degeneration (nAMD) following anti-vascular endothelial growth factor (anti-VEGF) therapy. Electronic searches were performed in PubMed, Embase, Web of Science, and IEEE Xplore from database inception through 18 December 2025, without language restrictions. The search strategy combined controlled vocabulary and free-text terms related to neovascular AMD, anti-VEGF agents, artificial intelligence or machine learning methods, and treatment response or outcome prediction; the complete search strategy is provided in Supplementary Table S3.
Eligible studies included prospective or retrospective cohort studies and prediction model development or validation studies enrolling patients with nAMD treated with approved anti-VEGF agents (including ranibizumab, aflibercept, bevacizumab, brolucizumab, or faricimab). We included studies that applied machine learning or deep learning models to predict post-treatment outcomes using OCT-based imaging, other retinal imaging modalities, clinical variables, or multimodal inputs and that reported model performance metrics (e.g., sensitivity, specificity, area under the curve, and accuracy) or provided sufficient data for their derivation. Analyses at either the eye level or patient level were eligible.
Studies were excluded if they focused solely on AMD detection, classification, or staging without explicit treatment response prediction; involved only dry AMD or other retinal diseases; evaluated non-AI-based statistical models without machine learning; reported only lesion-level analyses without eye- or patient-level outcomes; or lacked extractable outcome or performance data for quantitative synthesis. Case reports, small case series, reviews, editorials, letters, conference abstracts without full text, and simulation studies without real patient data were also excluded. Titles and abstracts were screened independently by both reviewers, followed by full-text review using predefined criteria, with discrepancies resolved by consensus.

2.3. Data Extraction and Management

Two reviewers (L.-W.T. and T.-W.W.) independently extracted data using a predefined, standardized extraction form. Extracted information was organized into structured domains covering study characteristics (first author, publication year, geographic region, number of centers, and study design), prediction framework characteristics (prediction horizon categorized as short term or long term, model type such as deep learning or radiomics, and target outcome defined as functional or anatomical response), and sample characteristics (analysis unit and sample size category).
For quantitative synthesis, reviewers extracted or derived confusion matrix data, including the numbers of true positives (tp), false negatives (fn), true negatives (tn), and false positives (fp) corresponding to each study’s predefined response classification threshold. When confusion matrices were not explicitly reported, these values were reconstructed from reported sensitivity, specificity, prevalence, or outcome distributions whenever possible. Each study was treated as an independent prediction model evaluation, with outcomes analyzed at the eye level or patient level as reported. To avoid within-study multiplicity, we prespecified that each study would contribute one indexed model-outcome pair. We extracted the study’s primary analysis (the authors’ main reported model for the prespecified endpoint and horizon), preferentially using performance from an independent held-out test set (or the reported final evaluation set). When multiple model variants were presented, we selected the model emphasized as primary by the authors and most comparable across studies (baseline OCT-based prediction of the stated endpoint). Each study contributed a single 2 × 2 table at the reported operating point; when the thresholding rule was not stated, we treated it as “not reported/implicit” rather than optimizing thresholds post hoc. Discrepancies in data extraction were resolved through discussion and consensus between the two reviewers. The finalized dataset was used for pooled analyses of sensitivity and specificity and for subgroup analyses according to model type, prediction target (functional vs. anatomical), prediction horizon, and study setting.

2.4. Quality Assessment

Risk of bias was assessed independently by two reviewers (L.-W.T. and T.-W.W.) using the Prediction model Risk Of Bias Assessment Tool (PROBAST), applied within an AI-specific framework (PROBAST-AI) [22]. This assessment evaluated four key domains: (1) participants, covering data source representativeness and selection criteria; (2) predictors, focusing on the definition and timing of input variables; (3) outcome, assessing the method and blinding of the reference standard; and (4) analysis, which scrutinized sample size adequacy, overfitting, and data leakage. Studies were stratified as having low, unclear, or high risk of bias based on the highest risk identified in any single domain. Special emphasis was placed on AI-specific methodological pitfalls, including the inappropriate inclusion of bilateral eyes without statistical adjustment, small test sets relative to model complexity, lack of external validation, and implausibly high-performance metrics indicative of overfitting. Discrepancies in quality assessment were resolved through discussion and consensus between the two reviewers.

2.5. Statistical Analysis

For each included study, 2 × 2 contingency tables were constructed to obtain the numbers of true positives (tp), false positives (fp), true negatives (tn), and false negatives (fn) according to each model’s predefined threshold for classifying treatment response. When confusion matrices were not explicitly reported, these values were derived from published sensitivity, specificity, and sample size data when possible. Pooled estimates of sensitivity and specificity and their corresponding 95% confidence intervals (CIs) were calculated using a bivariate random-effects logistic regression model, which jointly models sensitivity and specificity while accounting for their correlation and between-study heterogeneity. Summary receiver operating characteristic (SROC) curves and forest plots were generated to visualize overall and study-level predictive performance.
Prespecified subgroup analyses were conducted to explore potential sources of heterogeneity, stratified by model type (deep learning vs. radiomics), prediction target (functional vs. anatomical response), prediction horizon (short term vs. long term), study setting (single center vs. multicenter), and geographic region (Asia vs. Western countries). We additionally performed a prespecified influence check and sensitivity analyses excluding studies with extreme operating points that exerted disproportionate influence on pooled estimates. Publication bias was assessed using Deeks’ funnel plot asymmetry test [23], and robustness of pooled estimates was evaluated through sensitivity analyses with sequential study exclusion. All analyses were performed using Stata version 18.0 (StataCorp, College Station, TX, USA) with the metadta [24] and midas packages, with a two-sided p-value < 0.05 considered statistically significant.

3. Results

3.1. Study Identification and Selection

Our systematic literature review, depicted in the PRISMA flowchart (Figure 1), began with a search across PubMed, EMBASE, Web of Science, and IEEE Xplore, yielding 517 studies. After removing 121 duplicates, we screened 396 articles using EndNote, excluding 206 for insufficient relevance. Further scrutiny of 190 full-text articles led to additional exclusions for various reasons. Ultimately, this process resulted in selecting seven studies [15,16,25,26,27,28,29] for our systematic review and meta-analysis.

3.2. Study Design, Data Sources, and Clinical Characteristics of Included Studies

Table 1 and Table 2 summarize the study design, data sources, and clinical and treatment characteristics of the seven studies included in this review evaluating artificial intelligence-based prediction of treatment response in neovascular age-related macular degeneration (AMD). The included studies were conducted across multiple high-income regions, including South Korea [25,27,28], the United States [15,16], Taiwan [26], and the United Kingdom [29]. Most studies employed retrospective single-center designs [16,25,26,27,29], reflecting the real-world nature of available clinical data, while one study represented a post hoc analysis of a prospective, multicenter randomized controlled trial (OSPREY) [15]. Data sources primarily consisted of tertiary referral eye hospitals and institutional electronic health records, with one study leveraging rigorously collected trial data.
Across studies, sample sizes varied substantially, ranging from small cohorts of fewer than 50 patients [16] to large real-world datasets exceeding 1500 eyes [29]. The study populations were predominantly elderly, with mean or median ages ranging from approximately 70 to 80 years, consistent with the epidemiology of neovascular AMD. Female representation varied across studies, from 33.2% to 60.6%, and was not uniformly reported. All studies evaluated patients receiving anti-vascular endothelial growth factor (anti-VEGF) therapy, most commonly ranibizumab and aflibercept, with some including brolucizumab and bevacizumab. Treatment protocols generally followed standard clinical practice, consisting of three-monthly loading injections followed by pro re nata or treat-and-extend regimens, although protocol-driven extension was applied in the randomized trial setting. Outcome assessments varied across studies, encompassing short-term anatomical response, early recurrence after fluid resolution, and functional visual outcomes at 12 months, highlighting heterogeneity in target definitions for treatment response prediction.

3.3. Overview of Machine Learning Models and Outcome Definitions

Table 3 and Table 4 summarize the input data modalities, model architectures, validation strategies, and outcome definitions of the machine learning approaches used to predict treatment response in neovascular age-related macular degeneration (AMD). Across all included studies, spectral-domain optical coherence tomography (SD-OCT) served as the sole imaging input modality, underscoring its central role in contemporary AMD management and its suitability for data-driven modeling. Most studies relied exclusively on baseline or pre-injection OCT scans [15,16,26,27,28,29], while one study incorporated pre-treatment imaging explicitly defined relative to injection timing [25]. Models were evaluated at the eye or patient level as reported; several studies explicitly enforced one eye per patient, whereas others included bilateral eyes (more eyes than patients) without clearly reported clustering adjustments.
A range of machine learning paradigms was employed, reflecting methodological heterogeneity across studies. Convolutional neural network (CNN)-based architectures were commonly used, including DenseNet-201 backbones [25,28], as well as hybrid approaches integrating CNN-derived features with structured clinical variables or gradient boosting models [26,27]. Several studies adopted radiomics-based feature extraction from OCT images, including two-dimensional pigment epithelial detachment texture features [16] and three-dimensional OCT radiomics [15]. One study leveraged an automated commercial platform (Google Cloud AutoML Tables) combining neural networks and gradient-boosted decision trees [29]. Validation strategies varied substantially, encompassing k-fold cross-validation, leave-one-out cross-validation, repeated cross-validation runs, internal hold-out validation, and stratified train–test splits, highlighting differences in model robustness assessment and risk of overfitting.
Threshold reporting is inconsistent: many papers report sensitivity/specificity but do not specify the operating-point selection rule, while one study explicitly reports a default probability cutoff (0.5) and explores threshold tuning (0.4). Target outcomes and class definitions also differed across studies (Table 4), reflecting the absence of a unified definition of treatment response in neovascular AMD. Anatomical outcomes were most frequently evaluated, including short-term fluid resolution, early recurrence after achieving a dry macula, and durability of anatomical response [15,16,25,28]. Functional outcomes, primarily visual acuity at 12 months, were assessed in several studies using clinically meaningful thresholds such as Snellen line improvement, logMAR cutoffs, or ETDRS letter scores [26,27,29]. Reference standards ranged from expert adjudication by experienced retinal specialists to quantitative OCT-derived fluid metrics and standardized best-corrected visual acuity measurements. This heterogeneity in outcome definitions and ground truth labeling represents a key methodological challenge for cross-study comparison and meta-analytic synthesis of AI-based treatment response prediction models.

3.4. Quality Assessment

Risk of bias and applicability were evaluated using the PROBAST-AI framework (Table 5). Overall, six of the seven included studies were rated as having a high overall risk of bias, while one study demonstrated a moderate risk profile. The analysis domain was the principal contributor to high risk, largely due to the absence of external validation, limited sample sizes relative to model complexity, and incomplete reporting of calibration, overfitting mitigation, and missing data handling. The participant’s domain was generally judged to be at moderate risk, reflecting the predominance of retrospective, single-center cohorts. Similarly, the outcomes domain showed moderate risk because of heterogeneity in outcome definitions and variability in reference standards across studies.
In contrast, the predictors domain was consistently rated as low risk of bias, as all models relied on baseline or pre-treatment OCT images that are routinely available at the time of treatment initiation. Applicability concerns were uniformly low, indicating that study populations, predictors, and clinical contexts were broadly representative of real-world anti-VEGF treatment decision making in neovascular age-related macular degeneration.

3.5. Overall Meta-Analysis of Predictive Accuracy of OCT-Based Machine Learning Models

Han et al. (2024) [25] showed an extreme operating point (sensitivity = 0.97, with specificity = 0.01), far outside the range of other anatomical outcome studies (specificity ~0.40–0.80). This pattern indicates a near “always positive” classification threshold and produced disproportionate influence on pooled specificity. We therefore present primary pooled estimates excluding this outlier and provide sensitivity analyses including it (Supplementary Figure S2). The bivariate random-effects model (six studies) estimated a pooled sensitivity of 0.79 (95% CI 0.68–0.87) and pooled specificity of 0.83 (95% CI 0.62–0.94) for OCT-based machine learning prediction of anti-VEGF treatment outcomes in neovascular AMD. There was substantial between-study heterogeneity (generalized I2 = 74.36%; sensitivity I2 = 67.14%; specificity I2 = 84.91%) with moderate positive correlation between sensitivity and specificity across studies (ρ = 0.52). A likelihood-ratio test strongly favored the random-effects specification over a fixed-effects model (χ2 = 144.42; df = 3; p < 0.001), supporting meaningful variability in accuracy across studies. Study-level estimates ranged from 0.67 to 0.93 for sensitivity and 0.40 to 0.98 for specificity, with particularly low specificity in Jung et al.’s study (0.40) and comparatively high paired performance in Yeh et al.’s (sensitivity 0.93; specificity 0.94) and Kim et al.’s (specificity 0.98) studies. These results, including the corresponding forest plots, are presented in Figure 2, while the summary receiver operating characteristic (SROC) plot is depicted in Figure 3. No significant publication bias was observed for the included studies. (p = 0.12, Supplementary Figure S1).

3.6. Subgroup Analysis

We conducted subgroup analyses to explore sources of heterogeneity (Table 6). Geographic region (Asian vs. Western studies) did not show a significant difference in performance (p = 0.42 for sensitivity; p = 0.43 for specificity), nor did study setting (single center vs. multicenter, p = 0.49 and 0.12). This suggests that, broadly, model accuracy was not clearly dependent on region or single vs. multicenter data source, although only one multicenter study was included.
Model feature type (radiomics vs. deep learning) similarly showed no significant differences (sensitivity ~0.80 in both; specificity 0.72 vs. 0.87, p = 0.42 for specificity). This indicates that radiomics-based approaches performed comparably to deep learning models on average, a noteworthy finding given debates about handcrafted features versus deep features. It implies that no approach had an inherent advantage across these studies, although sample sizes were small (N = 2 radiomics studies).
In contrast, prediction horizon and outcome type appeared to influence specificity. Models predicting long-term outcomes (typically 12-month results) achieved much higher pooled specificity (0.92, 95% CI = 0.83–0.96) than those predicting short-term outcomes (0.48, 95% CI = 0.21–0.76), with a statistically significant difference (p = 0.01). Sensitivity did not differ as much by horizon (short-term 0.73 vs. long-term 0.82, p = 0.40). Similarly, models predicting functional outcomes (visual acuity) had significantly higher specificity (0.94) than those predicting strictly anatomical outcomes (0.61, p = 0.02), while sensitivities were in a closer range (0.82 vs. 0.76, p = 0.48). These patterns suggest that short-term anatomical predictions (like early fluid recurrence) may be prone to high false-positive rates, whereas longer-term or functional predictions yielded cleaner separation of responders vs. non-responders.
Finally, studies with a larger sample size (>500 eyes) had higher specificity (0.94) than smaller studies (0.61, p = 0.02), although sensitivity again was similar (0.82 vs. 0.76). This may reflect that larger datasets allow models to be more finely tuned to avoid false alarms or that smaller studies reported more optimistic sensitivity at the expense of specificity. It underscores the value of adequate sample size in developing clinically precise models.

4. Discussion

4.1. Principal Findings

In this systematic review and meta-analysis, OCT-based AI/ML models demonstrated a pooled sensitivity of 0.79 and specificity of 0.83 for predicting treatment response in neovascular AMD, although with considerable heterogeneity across studies. Clinically, this suggests that current models have moderate-to-high discriminative ability to differentiate likely good responders from poor responders. Such performance might be useful for decision support (e.g., identifying high-risk eyes that may need closer monitoring or prioritization for further assessment), but it is not yet at the level for standalone decision-making. Notably, these discrimination metrics do not directly establish clinical utility: without prespecified clinical pathways or decision-analytic evaluation (e.g., net benefit/decision curves), we cannot quantify whether using predictions would improve outcomes or reduce unnecessary treatment changes. The substantial variability in accuracy between studies indicates that real-world performance will depend heavily on local factors like patient population and how “response” is defined. Importantly, even the best-performing models still have a trade-off: for instance, a sensitivity around 0.8 means about 20% of suboptimal responders might be missed, and a specificity around 0.8 means about 20% of predicted non-responders would do well (potentially leading to unnecessary treatment changes if acted upon). These are promising results but also a reminder that the models are far from perfect.
From an ML perspective, one key finding is that no single type of modeling approach categorically outperformed others. Radiomics-based feature models achieved similar accuracy to deep learning models in our analysis. This implies that the features predictive of anti-VEGF response can be captured by different means—whether via handcrafted imaging features or automatically learned deep features—and that success may hinge more on data quality and study design than on algorithm novelty. Furthermore, the higher specificity observed in long-term functional outcome predictions suggests that predictive signals in baseline OCT may translate better to broad, downstream outcomes (like final visual acuity) than to short-term anatomical fluctuations that can be noisy and protocol dependent.

4.2. Machine Learning and Knowledge Extraction

A useful lens to interpret these results is to consider what “knowledge” ML models extract from OCT images. OCT features known to correlate with outcomes—particularly fluid compartments, lesion morphology, and outer retinal integrity—have been widely studied as prognostic biomarkers in nAMD [7,8,9]. Radiomics formalizes this knowledge by converting OCT into quantitative descriptors (texture, shape, and intensity), aligning with the paradigm that images are data [10,15,16]. Deep learning, by contrast, learns internal representations directly from pixel data and has performed strongly in ophthalmic imaging tasks [11,12]; it can also automate detection of relevant features (e.g., fluid) [13], potentially standardizing inputs. In our meta-analysis, radiomics and deep learning approaches showed no significant difference in average performance, suggesting that the underlying signal linking baseline OCT patterns to treatment outcomes can be captured by both paradigms. Importantly, apparent performance is often driven as much by validation design as by model type. Several included studies relied on internal random splits or non-nested cross-validation, which can yield optimistic estimates when feature selection/hyperparameters are effectively tuned on the same data used for evaluation. In addition, data-leakage risks—such as image-level splitting when multiple scans from the same eye/patient exist or unclear handling of bilateral eyes—can inflate discrimination by allowing correlated information to appear across training and test sets. By contrast, external or temporal validation (training on earlier data and testing on later cohorts or different centers/devices) was uncommon, limiting confidence in transportability across real-world settings. These ML-specific design choices plausibly contribute to both the heterogeneity we observed and the gap between proof-of-concept performance and deployment-ready reliability. Therefore, rather than algorithm novelty alone, factors such as outcome definition, representativeness of the dataset, leakage control, and rigor of validation are likely key determinants of downstream utility [22,30].

4.3. Why Sensitivity/Specificity Matters for Clinical Use

Many AI studies in ophthalmology report AUROC as a summary of performance, but moving toward clinical applicability requires fixing decision thresholds and evaluating sensitivity and specificity (or related metrics like predictive values). In the context of nAMD treatment, a model with high sensitivity would be valuable to catch nearly all poor responders early (avoiding undertreatment), whereas a model with high specificity would help avoid falsely labeling good responders as failures (preventing overtreatment or unnecessary switches). The pooled sensitivity (~0.79) and specificity (~0.83) we found indicate a reasonably good balance suitable for a triage or supportive role. For example, a clinic might use a sensitive model to flag patients at high risk of poor response so they can be monitored more closely or considered for adjunctive therapies; concurrently, the specificity ~0.83 suggests that about 17% of flagged “at-risk” patients could turn out to respond well anyway, which is a manageable false-positive rate in a monitoring context. Our decision to meta-analyze threshold-based metrics was driven by this need to interpret model performance in a decision-making frame. By contrast, if we only looked at AUROCs (which several included studies reported in the 0.70–0.85 range), we might conclude that the models are “pretty good”, but that would obscure how they might actually be used. Framing the results in terms of sensitivity/specificity makes the findings more actionable for clinicians who might integrate AI predictions into their treatment decisions. However, threshold-based metrics are inherently operating point dependent, and the included studies often used different or incompletely reported thresholding strategies (e.g., implicit default cutoffs, data-driven optimization, or unspecified operating points). Therefore, the pooled sensitivity and specificity should be interpreted as a summary of the performance at the reported operating points across studies rather than performance at a harmonized clinical threshold. Future work should prespecify clinically meaningful operating points and report threshold selection and calibration more transparently. Moreover, even well-calibrated sensitivity/specificity does not, by itself, demonstrate benefit in practice. Decision-analytic evaluation (e.g., decision curves/net benefit) is needed to translate predicted risk into actionable thresholds and explicit management pathways (e.g., intensified monitoring vs. early switch vs. durability-guided dosing) and to quantify the trade-off between missed poor responders and unnecessary escalation. Because such analyses were rarely reported in the primary studies, our findings should be interpreted as evidence of discrimination rather than proven clinical utility.

4.4. Interpreting Heterogeneity and Moderator Patterns

We observed that heterogeneity in model performance was partly structured: models predicting short-term anatomical outcomes (like fluid status after three injections) showed much lower specificity on average than those predicting 12-month outcomes. One interpretation is that short-term OCT changes are influenced by numerous factors including transient physiological variation and strictness of grading criteria, making it harder for models to distinguish noise from true non-response. In contrast, a 12-month outcome (especially a functional one like final vision) integrates the effect of the entire treatment course and may be less sensitive to any single OCT visit’s quirks. Thus, models targeting longer-term endpoints might inherently achieve cleaner separation between responders and non-responders (hence higher specificity). Another finding was that larger sample sizes were associated with higher specificity. This could be due to better training of models to recognize truly negative cases or possibly due to publication bias (smaller negative studies might not be published), though our funnel plot did not show clear asymmetry.
The lack of difference in sensitivity or specificity between Asian and Western studies, and between single vs. multicenter studies, is somewhat reassuring in terms of generalizability: it suggests that the concept of using OCT to predict treatment outcome is not confined to one healthcare system or ethnicity. However, given the limited number of studies, this should be interpreted cautiously.
These subgroup analyses are exploratory, given only six studies in the meta-analysis. We treated these findings as hypothesis generating. For example, the idea that functional outcome prediction yields higher specificity needs confirmation in future meta-analyses when more studies are available. It might be confounded by other factors (indeed, the functional outcome studies also tended to be larger and more recent). Multivariable meta-regression was not feasible with so few studies, but as the field grows, it would be valuable to disentangle whether horizon, outcome type, sample size, or region independently affect performance. For now, our results hint that the prediction of long-term, functionally relevant outcomes might be the “sweet spot” for AI in nAMD, an insight that could inform what future studies prioritize.

4.5. Risk of Bias, Generalizability, and Reporting Gaps

The fact that almost all included studies were rated at high risk of bias (using PROBAST+AI) highlights a general issue in the current literature: many studies are in a proof-of-concept phase rather than a deployment-ready phase. Common limitations were lack of external validation (models often tested only on randomly split internal data), potential overfitting (complex models on small datasets), and incomplete reporting (e.g., not stating how missing OCT scans were handled or whether readers were masked to outcomes when grading OCTs). These issues can inflate reported performance. For instance, a model might appear highly sensitive in a single-center test but fail to generalize to another center’s OCT machine or patient demographics. This phenomenon has been noted across AI in healthcare, where initial excitement is tempered by later real-world evaluation. Indeed, our meta-analysis found substantial heterogeneity, which could partly stem from some models overfitting to their narrow training context.
To advance, the community should adopt established reporting standards for prediction model studies. Using checklists like TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) [31] and STARD (Standards for Reporting Diagnostic Accuracy) [32] can improve clarity on how studies were conducted, making it easier to judge bias. Additionally, AI-specific extensions such as CONSORT-AI for trials [33] and SPIRIT-AI for protocols [34] will become relevant once these predictive models move into prospective trials. In our review, only one study was a trial-based analysis [15], but we anticipate more prospective studies or implementations in the near future.
Another aspect is model interpretability and clinical integration. None of the studies focused on explainable AI techniques, yet for an ophthalmologist to trust and use a prediction, it helps to know why a model is saying a patient is high risk (e.g., “because there is subretinal fibrous tissue on the OCT that often correlates with poor response”). Incorporating explainability or at least performing feature importance analyses (as done in one study [19]) can mitigate the “black box” concern. Recent MAKE research has explored methods to improve interpretability of deep models in medical imaging [35], which could be applied here to increase clinician trust in AI predictions.
Finally, many studies had narrow definitions of response (like anatomy after three injections), which might not align with what patients and clinicians ultimately care about (vision and long-term disease control). We need consensus on outcome definitions for “AI-ready” endpoints in nAMD. Efforts by groups like the AMD Outcomes Initiative or the Macular Society could be valuable in standardizing what constitutes a responder, a partial responder, etc., in the context of routine care data. This would make it easier to pool and compare studies and to train models on larger combined datasets.

4.6. Implications and Next Steps

Our findings have several implications for both clinical practice and ML research in ophthalmology. First, the promise shown by these models suggests that in the near future, ophthalmologists could have access to an OCT-based tool that, at baseline, flags patients at high risk of needing more intensive therapy. This might influence how aggressively to treat (e.g., consider early switch from one anti-VEGF agent to another or the addition of steroids, etc., in a patient predicted to respond poorly to standard therapy). However, given the heterogeneity and bias risks, no model is ready for prime time yet. Before deployment, we need prospective validation in multicenter, multi-device studies. Ideally, an adaptive trial could test whether having AI predictions improves patient outcomes or resource allocation.
For the ML community, our review highlights the need to focus on generalizability and robustness. This might involve training models on more diverse datasets (e.g., using federated learning across hospitals), performing extensive external validation, and incorporating techniques to handle domain shifts (like when a new OCT device is used). It also underscores that bigger data is not the only answer: careful study design and avoiding information leakage are equally critical. Many of the high-bias studies could have been improved by methods like nested cross-validation for model selection or using an independent test set from a later time.
Another future direction is to combine imaging with other data (genetic markers and systemic factors) to see if predictions improve. Some known predictors of anti-VEGF response include baseline visual acuity and certain risk alleles; integrating these with OCT might yield more powerful models. Also, as new therapies (e.g., longer-acting biologics or gene therapy) emerge, the definition of “response” may evolve (for instance, durability might become an even more crucial outcome). AI models will need to adapt to these changes and be revalidated.
Lastly, the relatively good performance of both radiomics and deep learning suggests a possible synergy: radiomics features could be used to inform or cross-check deep models (“hybrid” models). This might improve interpretability as well, since radiomics features are easier to relate to clinical concepts (e.g., “texture feature X corresponds to subretinal fibrosis”). Approaches that generate human-understandable features as intermediate outputs (like quantifying all relevant lesion features via segmentation and then predicting outcome) could strike a balance between deep learning’s power and radiomics’ transparency.
In summary, our meta-analysis provides a benchmark of current AI capabilities in predicting nAMD treatment outcomes and offers insights to guide future research. We emphasize that achieving clinical impact will require not just incremental model accuracy gains but also addressing bias, improving interpretability, and rigorously demonstrating generalizability. From a machine learning standpoint, the comparable performance of deep and feature-based models indicates that algorithm choice may be less critical than ensuring robust study design, adequate sample size, and standardized endpoints. Future AI development in ophthalmology should therefore prioritize generalizability, interpretability, and bias mitigation to effectively translate these predictive tools into practice.

5. Conclusions

In conclusion, this systematic review and meta-analysis demonstrates that OCT-based AI models can predict anti-VEGF treatment response in neovascular AMD with pooled sensitivity of ~0.79 and specificity of ~0.83. This represents moderate discriminative performance, indicating potential utility in clinical decision support to stratify responders versus non-responders. However, we observed substantial heterogeneity between studies and pervasive risk of bias, highlighting the need for prospective, multicenter validation of these models before routine clinical adoption. From an ML perspective, our findings suggest no clear winner between deep learning and radiomics approaches—both achieved similar accuracy—implying that data quality and study rigor are more influential on performance than the specific algorithm. Going forward, collaborative efforts should focus on external validation, standardized outcome definitions, and incorporation of explainability and bias mitigation strategies. With these steps, AI-driven predictive modeling holds promise to enable more personalized and efficient treatment of nAMD in the future.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/make8010023/s1: Table S1. PRISMA-DTA abstract checklist.; Table S2. PRISMA-DTA checklist; Table S3. Keywords and search results for different databases; Figure S1. Deek’s test plot; Figure S2. Sensitivity analysis.

Author Contributions

Conceptualization, W.-T.L. and T.-W.W.; methodology, W.-T.L. and T.-W.W.; software, W.-T.L. and T.-W.W.; validation, W.-T.L. and T.-W.W.; formal analysis, W.-T.L. and T.-W.W.; investigation, W.-T.L. and T.-W.W.; resources, W.-T.L. and T.-W.W.; data curation, W.-T.L. and T.-W.W.; writing—original draft preparation, W.-T.L. and T.-W.W.; writing—review and editing, W.-T.L. and T.-W.W.; visualization, W.-T.L. and T.-W.W.; supervision, T.-W.W.; project administration, T.-W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available on request due to privacy and ethical restrictions.

Acknowledgments

ChatGPT (GPT-5.2, OpenAI) was used solely to improve language readability. The authors take full responsibility for the final content.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AMDAge-related macular degeneration
nAMDNeovascular age-related macular degeneration
AIArtificial intelligence
MLMachine learning
DLDeep learning
OCTOptical coherence tomography
SD-OCTSpectral-domain optical coherence tomography
anti-VEGFAnti-vascular endothelial growth factor
IRFIntraretinal fluid
SRFSubretinal fluid
SRHSubretinal hemorrhage
PEDPigment epithelial detachment
VAVisual acuity
ETDRSEarly Treatment Diabetic Retinopathy Study
logMARLogarithm of the minimum angle of resolution
AUCArea under the receiver operating characteristic curve
SROCSummary receiver operating characteristic
CIConfidence interval
PRISMA-DTAPreferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy
PROBASTPrediction model Risk of Bias Assessment Tool
PROBAST-AIPrediction model Risk of Bias Assessment Tool for Artificial Intelligence
INPLASYInternational Platform of Registered Systematic Review and Meta-Analysis Protocols
CNNConvolutional neural network
LOOCVLeave-one-out cross-validation
PRNPro re nata

References

  1. Wong, W.L.; Su, X.; Li, X.; Cheung, C.M.G.; Klein, R.; Cheng, C.Y.; Wong, T.Y. Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: A systematic review and meta-analysis. Lancet Glob. Health 2014, 2, e106–e116. [Google Scholar] [CrossRef]
  2. Ferris, F.L.I.I.I.; Wilkinson, C.P.; Bird, A.; Chakravarthy, U.; Chew, E.; Csaky, K.; Sadda, S.R. Clinical classification of age-related macular degeneration. Ophthalmology 2013, 120, 844–851. [Google Scholar] [CrossRef]
  3. Rosenfeld, P.J.; Brown, D.M.; Heier, J.S.; Boyer, D.S.; Kaiser, P.; Chung, C.Y.; Kim, R.Y. Ranibizumab for neovascular age-related macular degeneration. N. Engl. J. Med. 2006, 355, 1419–1431. [Google Scholar] [CrossRef]
  4. CATT Research Group; Martin, D.F.; Maguire, M.G.; Ying, G.-S.; Grunwald, G.E.; Fine, S.L.; Jaffe, G.L. Ranibizumab and bevacizumab for neovascular age-related macular degeneration. N. Engl. J. Med. 2011, 364, 1897–1908. [Google Scholar] [CrossRef] [PubMed]
  5. Heier, J.S.; Brown, D.M.; Chong, V.; Korobelnik, J.-F.; Kaiser, P.K.; Nguyen, Q.D.; Kirchhof, B.; Ho, A.; Ogura, Y.; Yancopoulos, G.D.; et al. Intravitreal aflibercept (VEGF Trap-Eye) in wet age-related macular degeneration. Ophthalmology 2012, 119, 2537–2548. [Google Scholar] [CrossRef] [PubMed]
  6. Spaide, R.F.; Jaffe, G.J.; Sarraf, D.; Freund, K.B.; Sadda, S.R.; Staurenghi, G.; Waheed, N.K.; Chakravarthy, U.; Rosenfeld, P.J.; Holz, F.G.; et al. Consensus Nomenclature for Reporting Neovascular Age-Related Macular Degeneration Data: Consensus on Neovascular Age-Related Macular Degeneration Nomenclature Study Group. Ophthalmology 2020, 127, 616–636. [Google Scholar] [CrossRef] [PubMed]
  7. Metrangolo, C.; Donati, S.; Mazzola, M.; Fontanel, L.; Messina, W.; D’Alterio, G.; Rubino, M.; Radice, P.; Premi, E.; Azzolini, C. OCT Biomarkers in Neovascular Age-Related Macular Degeneration: A Narrative Review. J. Ophthalmol. 2021, 2021, 9994098. [Google Scholar] [CrossRef]
  8. Chaudhary, V.; Matonti, F.; Zarranz-Ventura, J.; Stewart, M.W. Impact of fluid compartments on functional outcomes for patients with neovascular age-related macular degeneration: A systematic literature review. Retina 2022, 42, 589–606. [Google Scholar] [CrossRef]
  9. Lai, T.T.; Hsieh, Y.T.; Yang, C.M.; Ho, T.-C.; Yang, C.-H. Biomarkers of optical coherence tomography in evaluating the treatment outcomes of neovascular age-related macular degeneration: A real-world study. Sci. Rep. 2019, 9, 529. [Google Scholar] [CrossRef]
  10. Gillies, R.J.; Kinahan, P.E.; Hricak, H. Radiomics: Images Are More than Pictures, They Are Data. Radiology 2016, 278, 563–577. [Google Scholar] [CrossRef]
  11. Kermany, D.S.; Goldbaum, M.; Cai, W.; Valentim, C.C.S.; Liang, H.; Baxter, S.L.; McKeown, A.; Yang, G.; Wu, X.; Yan, F.; et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell 2018, 172, 1122–1131.e9. [Google Scholar] [CrossRef]
  12. De Fauw, J.; Ledsam, J.R.; Romera-Paredes, B.; Nikolov, S.; Tomasev, N.; Blackwell, S.; Askham, H.; Glorot, X.; O’donoghue, B.; Visentin, D.; et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 2018, 24, 1342–1350. [Google Scholar] [CrossRef] [PubMed]
  13. Schlegl, T.; Waldstein, S.M.; Bogunovic, H.; Endstraßer, F.; Sadeghipour, A.; Philip, A.-M.; Podkowinski, D.; Gerendas, B.S.; Langs, G.; Schmidt-Erfurth, U. Fully Automated Detection and Quantification of Macular Fluid in OCT Using Deep Learning. Ophthalmology 2018, 125, 549–558. [Google Scholar] [CrossRef]
  14. Maunz, A.; Barras, L.; Kawczynski, M.G.; Dai, J.; Lee, A.Y.; Spaide, R.F.; Sahni, J.; Ferrara, D. Machine Learning to Predict Response to Ranibizumab in Neovascular Age-Related Macular Degeneration. Ophthalmol. Sci. 2023, 3, 100319. [Google Scholar] [CrossRef]
  15. Kar, S.S.; Cetin, H.; Lunasco, L.; Le, T.K.; Zahid, R.; Meng, X.; Srivastava, S.K.; Madabhushi, A.; Ehlers, J.P. OCT-Derived Radiomic Features Predict Anti-VEGF Response and Durability in Neovascular Age-Related Macular Degeneration. Ophthalmol. Sci. 2022, 2, 100171. [Google Scholar] [CrossRef]
  16. Williamson, R.C.; Selvam, A.; Sant, V.; Patel, M.; Bollepalli, S.C.; Vupparaboina, K.K.; Sahel, J.A.; Chhablani, J. Radiomics-Based Prediction of Anti-VEGF Treatment Response in Neovascular Age-Related Macular Degeneration with Pigment Epithelial Detachment. Transl. Vis. Sci. Technol. 2023, 12, 3. [Google Scholar] [CrossRef] [PubMed]
  17. Moon, J.W.; Lee, Y.; Hwang, J.; Kim, C.G.; Kim, J.W.; Yoon, W.T.; Kim, J.H. Prediction of anti-vascular endothelial growth factor agent-specific treatment outcomes in neovascular age-related macular degeneration using a generative adversarial network. Sci. Rep. 2023, 13, 5546. [Google Scholar] [CrossRef]
  18. Feng, D.; Chen, X.; Wang, X.; Mou, X.; Bai, L.; Zhang, S.; Zhou, Z. Predicting effectiveness of anti-VEGF injection through self-supervised learning in OCT images. Math. Biosci. Eng. 2023, 20, 2439–2458. [Google Scholar] [CrossRef] [PubMed]
  19. Perkins, S.W.; Wu, A.K.; Singh, R.P. Predictors of limited early response to anti-vascular endothelial growth factor therapy in neovascular age-related macular degeneration with machine learning feature importance. Saudi J. Ophthalmol. 2022, 36, 315–321. [Google Scholar] [CrossRef]
  20. McInnes, M.D.F.; Moher, D.; Thombs, B.D.; McGrath, T.A.; Bossuyt, P.M.; Clifford, T.; Cohen, J.F.; Deeks, J.J.; Gatsonis, C.; Hooft, L.; et al. Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies: The PRISMA-DTA Statement. JAMA 2018, 319, 388–396. [Google Scholar] [CrossRef]
  21. Luo, W.; Wang, T. Artificial Intelligence Predicting Treatment Response in Neovascular Age Macular Degeneration with Anti-VEGF: A Systematic Review and Meta-Analysis. 2025. Available online: https://inplasy.com/inplasy-2025-12-0086/ (accessed on 26 December 2025).
  22. Moons, K.G.M.; Damen, J.A.A.; Kaul, T.; Hooft, L.; Navarro, C.A.; Dhiman, P.; Beam, A.L.; Van Calster, B.; Celi, L.A.; Denaxas, S.; et al. PROBAST+AI: An updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ 2025, 388, e082505. [Google Scholar] [CrossRef]
  23. Deeks, J.J.; Macaskill, P.; Irwig, L. The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. J. Clin. Epidemiol. 2005, 58, 882–893. [Google Scholar] [CrossRef]
  24. Nyaga, V.N.; Arbyn, M. Metadta: A Stata command for meta-analysis and meta-regression of diagnostic test accuracy data—A tutorial. Arch. Public Health 2022, 80, 95, Erratum in Arch. Public Health 2022, 80, 216. https://doi.org/10.1186/s13690-021-00747-5. [Google Scholar] [CrossRef]
  25. Han, J.M.; Han, J.; Ko, J.; Jung, J.; Park, J.I.; Hwang, J.S.; Yoon, J.; Jung, J.H.; Hwang, D.D. Anti-VEGF treatment outcome prediction based on optical coherence tomography images in neovascular age-related macular degeneration using a deep neural network. Sci. Rep. 2024, 14, 28253. [Google Scholar] [CrossRef] [PubMed]
  26. Yeh, T.C.; Luo, A.C.; Deng, Y.S.; Lee, Y.H.; Chen, S.J.; Chang, P.H.; Lin, C.J.; Tai, M.C.; Chou, Y.B. Prediction of treatment outcome in neovascular age-related macular degeneration using a novel convolutional neural network. Sci. Rep. 2022, 12, 5871. [Google Scholar] [CrossRef]
  27. Kim, N.; Lee, M.; Chung, H.; Kim, H.C.; Lee, H. Prediction of Post-Treatment Visual Acuity in Age-Related Macular Degeneration Patients with an Interpretable Machine Learning Method. Transl. Vis. Sci. Technol. 2024, 13, 3. [Google Scholar] [CrossRef] [PubMed]
  28. Jung, J.; Han, J.; Han, J.M.; Ko, J.; Yoon, J.; Hwang, J.S.; Park, J.I.; Hwang, G.; Jung, J.H.; Hwang, D.D. Prediction of neovascular age-related macular degeneration recurrence using optical coherence tomography images with a deep neural network. Sci. Rep. 2024, 14, 5854. [Google Scholar] [CrossRef] [PubMed]
  29. Abbas, A.; O’Byrne, C.; Fu, D.J.; Moraes, G.; Balaskas, K.; Struyven, R.; Beqiri, S.; Wagner, S.K.; Korot, E.; Keane, P.A. Evaluating an automated machine learning model that predicts visual acuity outcomes in patients with neovascular age-related macular degeneration. Graefe’s Arch. Clin. Exp. Ophthalmol. 2022, 260, 2461–2473. [Google Scholar] [CrossRef]
  30. Ferrara, D.; Newton, E.M.; Lee, A.Y. Artificial intelligence-based predictions in neovascular age-related macular degeneration. Curr. Opin. Ophthalmol. 2021, 32, 389–396. [Google Scholar] [CrossRef]
  31. Collins, G.S.; Reitsma, J.B.; Altman, D.G.; Moons, K.G.M. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Ann. Intern. Med. 2015, 162, 55–63. [Google Scholar] [CrossRef]
  32. Bossuyt, P.M.; Reitsma, J.B.; Bruns, D.E.; Gatsonis, C.A.; Glasziou, P.P.; Irwig, L.; Lijmer, J.G.; Moher, D.; Rennie, D.; de Vet, H.C.W.; et al. STARD 2015: An Updated List of Essential Items for Reporting Diagnostic Accuracy Studies. BMJ 2015, 351, h5527. [Google Scholar] [CrossRef]
  33. Liu, X.; Rivera, S.C.; Moher, D.; Calvert, M.J.; Denniston, A.K. CONSORT-AI extension: Reporting guidelines for clinical trials evaluating artificial intelligence interventions. BMJ 2020, 370, m3164. [Google Scholar] [CrossRef] [PubMed]
  34. Liu, X.; Rivera, S.C.; Moher, D.; Calvert, M.J.; Denniston, A.K. SPIRIT-AI extension: Guidance for clinical trial protocols for interventions involving artificial intelligence. BMJ 2020, 370, m3210. [Google Scholar] [CrossRef]
  35. Ennab, M.; Mcheick, H. Advancing AI interpretability in medical imaging: A comparative analysis of pixel-level interpretability and Grad-CAM models. Mach. Learn. Knowl. Extr. 2025, 7, 12. [Google Scholar] [CrossRef]
Figure 1. PRISMA flowchart of included studies.
Figure 1. PRISMA flowchart of included studies.
Make 08 00023 g001
Figure 2. Forest plot showing the sensitivity and specificity of artificial intelligence algorithms for predicting treatment response in neovascular age macular degeneration with anti-VEGF. Studies included: Williamson et al. (2024) [16], Sil Kar et al. (2022) [15], Yeh et al. (2022) [26], Kim et al. (2024) [27], Jung et al. (2024) [28], and Abbas et al. (2022) [29]. Squares indicate study point estimates and horizontal lines indicate 95% CIs. The red diamond indicates the pooled estimate (width = 95% CI), and the red dashed vertical line marks the pooled point estimate.
Figure 2. Forest plot showing the sensitivity and specificity of artificial intelligence algorithms for predicting treatment response in neovascular age macular degeneration with anti-VEGF. Studies included: Williamson et al. (2024) [16], Sil Kar et al. (2022) [15], Yeh et al. (2022) [26], Kim et al. (2024) [27], Jung et al. (2024) [28], and Abbas et al. (2022) [29]. Squares indicate study point estimates and horizontal lines indicate 95% CIs. The red diamond indicates the pooled estimate (width = 95% CI), and the red dashed vertical line marks the pooled point estimate.
Make 08 00023 g002
Figure 3. Summary receiver operating characteristic (SROC) curve visually demonstrates the accuracy of artificial intelligence algorithms for predicting treatment response in neovascular age macular degeneration with anti-VEGF. Dashed ellipses indicate the 95% confidence region (summary estimate) and 95% prediction region (expected range for a new study).
Figure 3. Summary receiver operating characteristic (SROC) curve visually demonstrates the accuracy of artificial intelligence algorithms for predicting treatment response in neovascular age macular degeneration with anti-VEGF. Dashed ellipses indicate the 95% confidence region (summary estimate) and 95% prediction region (expected range for a new study).
Make 08 00023 g003
Table 1. Study design and data sources of included studies.
Table 1. Study design and data sources of included studies.
AuthorCountryDesignCentersData SourceDisease
Han et al. (2024) [25]South KoreaRetrospectiveSingle centerKong Eye HospitalNeovascular AMD
Williamson et al. (2024) [16]USARetrospectiveSingle centerUniversity of PittsburghNeovascular AMD with PED
Yeh et al. (2022) [26]TaiwanRetrospectiveSingle centerTaipei Veterans General HospitalTypical neovascular AMD
Sil Kar et al. (2022) [15]USAPost hoc analysis of prospective RCTMulti centerOSPREY trialNeovascular AMD
Kim et al. (2024) [27]South KoreaRetrospectiveSingle centerKonkuk University Medical CenterNeovascular AMD
Jung et al. (2024) [28]South KoreaRetrospectiveMulticenterKong Eye Hospital and affiliated centersNeovascular AMD
Abbas et al. (2022) [29]UKRetrospectiveSingle centerMoorfields Eye Hospital electronic health recordsNeovascular AMD
Table 2. Clinical and treatment characteristics of included studies.
Table 2. Clinical and treatment characteristics of included studies.
AuthorSample SizeMean AgeFemaleAnti-VEGFRegimenOutcome Assessment
Han et al. (2024) [25]517 patients (517 eyes)71.4 ± 9.0 years40%Ranibizumab 71%;
aflibercept 29%
3 monthly loading injections1 month after 3rd injection
Williamson et al. (2024) [16]39 patients (39 eyes)NRNRMixed agents3 monthly loading injections
followed by as-needed therapy
Anatomical recurrence at 6 months
Yeh et al. (2022) [26]698 patients (698 eyes)78.47 ± 9.88 years33.24%Aflibercept 77.94%;
ranibizumab 22.06%
3 monthly loading injections + PRNOutcome assessed at 12 months
Sil Kar et al. (2022) [15]81 patients (81 eyes)Super responders: 81 ± 8;
non-super responders: 77 ± 10
Super responders: 73%;
non-super responders: 61%
Brolucizumab 6 mg;
aflibercept 2 mg
3 monthly loading injections;
protocol-driven extension (trial setting)
Complete IRF + SRF resolution by week 16;
maintained throughout 56 weeks
Kim et al. (2024) [27]527 eyes (506 patients)73.74 ± 8.83 years39.66%Aflibercept, ranibizumab, brolucizumab, and bevacizumab3 monthly loading injections; treat and extend or PRN12-month visual outcome
Jung et al. (2024) [28]269 patients (269 eyes; 1076 OCT images)70.7 ± 8.84 years42.00%Ranibizumab and aflibercept3 monthly loading injectionsRecurrence within 3 months after confirming dry-up 1 month post-3rd injection
Abbas et al. (2022) [29]1631 eyes (1547 patients)Median 80 years (IQR 73–85)60.60%Aflibercept and ranibizumab3 loading injections; treat and extend (or PRN)12 months after treatment initiation
Abbreviations: NR, not reported.
Table 3. Machine learning model inputs, architectures, and validation strategies.
Table 3. Machine learning model inputs, architectures, and validation strategies.
AuthorInput ModalityTimepoints UsedArchitectureValidationAnalysis Unit
Han et al. (2024) [25]SD-OCTPre-injectionDenseNet-201 backbone10-fold cross-validationPatient/eye level (1 eye per patient)
Williamson et al. (2024) [16]SD-OCTBaseline onlyRadiomics (52 PED texture features)3-fold hierarchical cross-validationPatient/eye level (1 eye per patient)
Yeh et al. (2022) [26]SD-OCTBaseline onlyHeterogeneous Data Fusion Net (HDF-Net and CNN + numeric data fusion)Internal hold-out cross-validationPatient/eye level (1 eye per patient)
Sil Kar et al. (2022) [15]SD-OCTBaseline only3D OCT radiomics3-fold cross-validation (500 runs)Patient/eye level (1 eye per patient)
Kim et al. (2024) [27]SD-OCTBaseline onlyCNN-derived OCT probability + XGBoost (RawM1)Train/validation/test splitEye level (bilateral possible)
Jung et al. (2024) [28]SD-OCTBaseline onlyDenseNet-201 CNNLOOCV (K-fold sensitivity analysis reported)Patient/eye level (1 eye per patient)
Abbas et al. (2022) [29]SD-OCTBaseline onlyGoogle Cloud AutoML Tables (ensemble of NN + GBDT)Stratified 85:15 train–test splitEye level (bilateral possible)
Table 4. Target outcomes, class definitions, and reference standards used for training and evaluating machine learning models.
Table 4. Target outcomes, class definitions, and reference standards used for training and evaluating machine learning models.
AuthorTarget OutcomeThresholdPositive ClassNegative ClassReference Standard
Han et al. (2024) [25]Anatomical responseNot reportedGood responder = dry maculaPersistent IRF or SRF2 retinal specialists (15+ years of experience)
Williamson et al. (2024) [16]Anatomical recurrenceNot reportedResponderRecurring fluid at 6 monthsOCT-based clinical assessment
Yeh et al. (2022) [26]Functional visual outcome at 12 monthsNot reportedVA improvement ≥ 2 Snellen linesVA improvement < 2 linesBest-corrected visual acuity at 12 months
Sil Kar et al. (2022) [15]Anatomical response durabilityNot reportedComplete IRF + SRF resolution by week 16
Maintained throughout 56 weeks
Non-super responderQuantitative OCT fluid volumes
Kim et al. (2024) [27]Functional visual outcome at 12 monthsNot reportedPoor VA (logMAR ≥ 1.0)Good VA (logMAR < 1.0)Best-corrected visual acuity at 12 months
Jung et al. (2024) [28]Anatomical recurrenceNot reportedRecurrence ≤ 3 monthsNo recurrence ≤ 3 monthsOCT evidence of IRF/SRF/SRH or increased PED
Abbas et al. (2022) [29]Visual acuity outcome at 12 months0.5; 0.4VA ≥ 70 ETDRS letters (“Above”)VA < 70 ETDRS letters (“Below”)ETDRS best-corrected VA at 12 months
Table 5. PROBAST-AI assessment of risk of bias and applicability of included studies.
Table 5. PROBAST-AI assessment of risk of bias and applicability of included studies.
StudyParticipants (RoB)Predictors (RoB)Outcomes (RoB)Analysis (RoB)Overall Risk of BiasApplicability Concerns
Han et al. (2024) [25]ModerateLowModerateHighHighLow
Williamson et al. (2024) [16]ModerateLowModerateHighHighLow
Yeh et al. (2022) [26]ModerateLowModerateHighHighLow
Sil Kar et al. (2022) [15]LowLowModerateModerateModerateLow
Kim et al. (2024) [27]ModerateLowModerateHighHighLow
Jung et al. (2024) [28]ModerateLowModerateHighHighLow
Abbas et al. (2022) [29]ModerateLowModerateHighHighLow
Table 6. Moderator analysis of predictive performance.
Table 6. Moderator analysis of predictive performance.
ModeratorCategoryNArtificial Algorithms
Sensitivityp-ValueSpecificityp-Value
CountryAsian30.75 (0.57~0.87)0.420.76 (0.41~0.94)0.43
Western30.79 (0.68~0.86)0.89 (0.63~0.97)
CenterSingle40.81 (0.70~0.89)0.490.90 (0.75~0.97)0.12
Multiple20.79 (0.69~0.87)0.84 (0.68~0.93)
Treatment horizonShort term20.73 (0.51~0.87)0.400.48 (0.21~0.76)0.01 *
Long term40.82 (0.70~0.89)0.92 (0.83~0.96)
Feature typeRadiomics20.80 (0.56~0.93)0.950.72 (0.27~0.95)0.42
Deep learning40.79 (0.67~0.88)0.87 (0.65~0.96)
OutcomeAnatomical30.76 (0.57~0.88)0.480.61 (0.36~0.82)0.02 *
Functional30.82 (0.70~0.90)0.94 (0.85~0.97)
Size<50030.76 (0.57~0.88)0.480.61 (0.36~0.82)0.02 *
>50030.82 (0.70~0.90)0.94 (0.85~0.97)
* p < 0.05 for between-subgroup difference (test of moderator effect).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luo, W.-T.; Wang, T.-W. Artificial Intelligence for Predicting Treatment Response in Neovascular Age Macular Degeneration with Anti-VEGF: A Systematic Review and Meta-Analysis. Mach. Learn. Knowl. Extr. 2026, 8, 23. https://doi.org/10.3390/make8010023

AMA Style

Luo W-T, Wang T-W. Artificial Intelligence for Predicting Treatment Response in Neovascular Age Macular Degeneration with Anti-VEGF: A Systematic Review and Meta-Analysis. Machine Learning and Knowledge Extraction. 2026; 8(1):23. https://doi.org/10.3390/make8010023

Chicago/Turabian Style

Luo, Wei-Ting, and Ting-Wei Wang. 2026. "Artificial Intelligence for Predicting Treatment Response in Neovascular Age Macular Degeneration with Anti-VEGF: A Systematic Review and Meta-Analysis" Machine Learning and Knowledge Extraction 8, no. 1: 23. https://doi.org/10.3390/make8010023

APA Style

Luo, W.-T., & Wang, T.-W. (2026). Artificial Intelligence for Predicting Treatment Response in Neovascular Age Macular Degeneration with Anti-VEGF: A Systematic Review and Meta-Analysis. Machine Learning and Knowledge Extraction, 8(1), 23. https://doi.org/10.3390/make8010023

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop