1. Introduction
Age-related macular degeneration (AMD) is a leading cause of irreversible vision loss worldwide, and the neovascular (“wet”) subtype (nAMD) accounts for most cases of severe, rapid visual deterioration. The global prevalence and projected burden of AMD continue to rise with population aging, making nAMD a persistent public health and clinical workload challenge [
1,
2]. Anti-vascular endothelial growth factor (anti-VEGF) therapy has transformed nAMD outcomes and remains the cornerstone of modern management, with landmark trials demonstrating clinically meaningful visual benefit in many patients [
3,
4,
5]. Yet, treatment response is heterogeneous: some eyes achieve robust anatomical improvement and visual stabilization, while others show incomplete or variable response despite receiving therapy. This variability matters because early identification of likely responders vs. limited responders could support more personalized care pathways, reduce avoidable clinic burden, and help clinicians counsel patients with greater precision [
3,
4,
5].
Optical coherence tomography (OCT) is central to nAMD assessment because it provides high-resolution cross-sectional visualization of the macula and enables quantification of disease activity through imaging phenotypes (for example, intraretinal fluid (IRF), subretinal fluid (SRF), and pigment epithelial detachment (PED) patterns). Standardized OCT terminology is particularly important for aggregating evidence across studies and devices, and consensus nomenclature has helped harmonize how neovascular AMD features are reported [
6]. In parallel, a substantial literature describes OCT-based biomarkers and their prognostic implications, including reviews and systematic syntheses evaluating how fluid compartments and structural changes relate to functional outcomes [
7,
8,
9].
The technical opportunity is that OCT is not only “readable” by humans; it is also machine readable at scale. Feature engineering and knowledge extraction from imaging—turning patterns into measurable variables—has matured across medical imaging, including the radiomics paradigm (images as high-dimensional data rather than pictures) [
10]. For OCT in nAMD, this includes both conventional biomarkers (thickness and fluid presence/volume) and higher-order texture/shape descriptors that may capture subtle disease states not easily summarized by a single thickness measure [
7,
8,
9,
10].
Over the last decade, artificial intelligence (AI)—particularly machine learning (ML) and deep learning—has shown strong performance in ophthalmic imaging tasks such as diagnosis and triage from OCT and other retinal modalities [
11,
12]. Deep learning has also enabled automated OCT quantification (e.g., fluid detection/segmentation), which is a key prerequisite for building scalable predictive systems [
13].
A natural next step beyond detection is prediction: using baseline (or near-baseline) OCT information to anticipate treatment response categories. Multiple approaches have been explored, including multimodal ML using OCT plus clinical variables, purely image-based deep networks, radiomics pipelines, generative approaches, and self-supervised representation learning [
14,
15,
16,
17,
18,
19]. For example, ML models have been trained to predict visual outcomes from baseline OCT and clinical variables [
14], radiomic features have been investigated as predictors of response and durability [
15,
16], generative models have been used for agent-specific outcome prediction [
17], and self-supervised OCT feature learning has been applied to classify treatment effectiveness with high discriminative performance [
18]. In routine care settings, ML has also been used to predict limited early response using baseline characteristics with feature importance analyses to improve interpretability [
19].
Despite rapid progress, the evidence base is fragmented across datasets, OCT acquisition protocols, outcome definitions, modeling choices, and reporting practices. Importantly, many studies emphasize threshold-independent metrics (e.g., area under the receiver operating characteristic curve (AUROC)), while clinicians often need thresholded performance that maps onto decisions: “Does this model correctly flag a likely limited responder?” That is exactly where sensitivity (true-positive rate) and specificity (true-negative rate) become clinically legible. Sensitivity is aligned with minimizing missed limited responders (false negatives), while specificity helps avoid unnecessary changes in management for eyes that would respond well (false positives). This makes sensitivity/specificity a pragmatic basis for evidence synthesis when the intended use resembles a clinical classification or triage problem rather than a purely comparative modeling benchmark.
Accordingly, there is a strong rationale for a systematic review and meta-analysis focused on OCT-based AI prediction of treatment response in nAMD using sensitivity and specificity as the primary meta-analyzable endpoints. The methodological toolkit for synthesizing diagnostic/predictive accuracy (jointly pooling sensitivity and specificity while accounting for their trade-off) is well established. To maximize interpretability and reduce confounding by clinical practice variation, this review is framed around model performance as reported, without attempting to compare or adjust for follow-up schedules or specific treatment regimens (i.e., the synthesis targets predictive discrimination rather than regimen effect estimation). By pooling results across studies, we aim to extract generalizable insights into which modeling strategies and features are most effective, providing guidance for both clinicians and ML researchers on the design of future predictive models.
2. Materials and Methods
2.1. General Guideline
This systematic review evaluates prognostic prediction models that use baseline OCT-derived information to predict future anti-VEGF treatment response. We report the review following PRISMA principles, and because our primary meta-analyzable endpoints were threshold-based sensitivity and specificity (2 × 2 tables), we additionally followed PRISMA-DTA [
20] items relevant to test accuracy synthesis and applied a bivariate random-effects model for joint pooling.
Methodological rigor and transparency were ensured through systematic application of the PRISMA checklist across all phases of study identification, selection, data extraction, and reporting (see
Supplementary Tables S1 and S2). The review protocol was prospectively registered in the INPLASY database (registration number: INPLASY2025120086) [
21]. As this study synthesized data from previously published literature and involved no direct interaction with human participants, ethical approval and informed consent were not required.
2.2. Database Searches and Identification of Eligible Manuscripts
Two reviewers (L.-W.T. and T.-W.W.) independently conducted a systematic literature search to identify studies evaluating artificial intelligence-based models for predicting treatment response in neovascular age-related macular degeneration (nAMD) following anti-vascular endothelial growth factor (anti-VEGF) therapy. Electronic searches were performed in PubMed, Embase, Web of Science, and IEEE Xplore from database inception through 18 December 2025, without language restrictions. The search strategy combined controlled vocabulary and free-text terms related to neovascular AMD, anti-VEGF agents, artificial intelligence or machine learning methods, and treatment response or outcome prediction; the complete search strategy is provided in
Supplementary Table S3.
Eligible studies included prospective or retrospective cohort studies and prediction model development or validation studies enrolling patients with nAMD treated with approved anti-VEGF agents (including ranibizumab, aflibercept, bevacizumab, brolucizumab, or faricimab). We included studies that applied machine learning or deep learning models to predict post-treatment outcomes using OCT-based imaging, other retinal imaging modalities, clinical variables, or multimodal inputs and that reported model performance metrics (e.g., sensitivity, specificity, area under the curve, and accuracy) or provided sufficient data for their derivation. Analyses at either the eye level or patient level were eligible.
Studies were excluded if they focused solely on AMD detection, classification, or staging without explicit treatment response prediction; involved only dry AMD or other retinal diseases; evaluated non-AI-based statistical models without machine learning; reported only lesion-level analyses without eye- or patient-level outcomes; or lacked extractable outcome or performance data for quantitative synthesis. Case reports, small case series, reviews, editorials, letters, conference abstracts without full text, and simulation studies without real patient data were also excluded. Titles and abstracts were screened independently by both reviewers, followed by full-text review using predefined criteria, with discrepancies resolved by consensus.
2.3. Data Extraction and Management
Two reviewers (L.-W.T. and T.-W.W.) independently extracted data using a predefined, standardized extraction form. Extracted information was organized into structured domains covering study characteristics (first author, publication year, geographic region, number of centers, and study design), prediction framework characteristics (prediction horizon categorized as short term or long term, model type such as deep learning or radiomics, and target outcome defined as functional or anatomical response), and sample characteristics (analysis unit and sample size category).
For quantitative synthesis, reviewers extracted or derived confusion matrix data, including the numbers of true positives (tp), false negatives (fn), true negatives (tn), and false positives (fp) corresponding to each study’s predefined response classification threshold. When confusion matrices were not explicitly reported, these values were reconstructed from reported sensitivity, specificity, prevalence, or outcome distributions whenever possible. Each study was treated as an independent prediction model evaluation, with outcomes analyzed at the eye level or patient level as reported. To avoid within-study multiplicity, we prespecified that each study would contribute one indexed model-outcome pair. We extracted the study’s primary analysis (the authors’ main reported model for the prespecified endpoint and horizon), preferentially using performance from an independent held-out test set (or the reported final evaluation set). When multiple model variants were presented, we selected the model emphasized as primary by the authors and most comparable across studies (baseline OCT-based prediction of the stated endpoint). Each study contributed a single 2 × 2 table at the reported operating point; when the thresholding rule was not stated, we treated it as “not reported/implicit” rather than optimizing thresholds post hoc. Discrepancies in data extraction were resolved through discussion and consensus between the two reviewers. The finalized dataset was used for pooled analyses of sensitivity and specificity and for subgroup analyses according to model type, prediction target (functional vs. anatomical), prediction horizon, and study setting.
2.4. Quality Assessment
Risk of bias was assessed independently by two reviewers (L.-W.T. and T.-W.W.) using the Prediction model Risk Of Bias Assessment Tool (PROBAST), applied within an AI-specific framework (PROBAST-AI) [
22]. This assessment evaluated four key domains: (1) participants, covering data source representativeness and selection criteria; (2) predictors, focusing on the definition and timing of input variables; (3) outcome, assessing the method and blinding of the reference standard; and (4) analysis, which scrutinized sample size adequacy, overfitting, and data leakage. Studies were stratified as having low, unclear, or high risk of bias based on the highest risk identified in any single domain. Special emphasis was placed on AI-specific methodological pitfalls, including the inappropriate inclusion of bilateral eyes without statistical adjustment, small test sets relative to model complexity, lack of external validation, and implausibly high-performance metrics indicative of overfitting. Discrepancies in quality assessment were resolved through discussion and consensus between the two reviewers.
2.5. Statistical Analysis
For each included study, 2 × 2 contingency tables were constructed to obtain the numbers of true positives (tp), false positives (fp), true negatives (tn), and false negatives (fn) according to each model’s predefined threshold for classifying treatment response. When confusion matrices were not explicitly reported, these values were derived from published sensitivity, specificity, and sample size data when possible. Pooled estimates of sensitivity and specificity and their corresponding 95% confidence intervals (CIs) were calculated using a bivariate random-effects logistic regression model, which jointly models sensitivity and specificity while accounting for their correlation and between-study heterogeneity. Summary receiver operating characteristic (SROC) curves and forest plots were generated to visualize overall and study-level predictive performance.
Prespecified subgroup analyses were conducted to explore potential sources of heterogeneity, stratified by model type (deep learning vs. radiomics), prediction target (functional vs. anatomical response), prediction horizon (short term vs. long term), study setting (single center vs. multicenter), and geographic region (Asia vs. Western countries). We additionally performed a prespecified influence check and sensitivity analyses excluding studies with extreme operating points that exerted disproportionate influence on pooled estimates. Publication bias was assessed using Deeks’ funnel plot asymmetry test [
23], and robustness of pooled estimates was evaluated through sensitivity analyses with sequential study exclusion. All analyses were performed using Stata version 18.0 (StataCorp, College Station, TX, USA) with the metadta [
24] and midas packages, with a two-sided
p-value < 0.05 considered statistically significant.
3. Results
3.1. Study Identification and Selection
Our systematic literature review, depicted in the PRISMA flowchart (
Figure 1), began with a search across PubMed, EMBASE, Web of Science, and IEEE Xplore, yielding 517 studies. After removing 121 duplicates, we screened 396 articles using EndNote, excluding 206 for insufficient relevance. Further scrutiny of 190 full-text articles led to additional exclusions for various reasons. Ultimately, this process resulted in selecting seven studies [
15,
16,
25,
26,
27,
28,
29] for our systematic review and meta-analysis.
3.2. Study Design, Data Sources, and Clinical Characteristics of Included Studies
Table 1 and
Table 2 summarize the study design, data sources, and clinical and treatment characteristics of the seven studies included in this review evaluating artificial intelligence-based prediction of treatment response in neovascular age-related macular degeneration (AMD). The included studies were conducted across multiple high-income regions, including South Korea [
25,
27,
28], the United States [
15,
16], Taiwan [
26], and the United Kingdom [
29]. Most studies employed retrospective single-center designs [
16,
25,
26,
27,
29], reflecting the real-world nature of available clinical data, while one study represented a post hoc analysis of a prospective, multicenter randomized controlled trial (OSPREY) [
15]. Data sources primarily consisted of tertiary referral eye hospitals and institutional electronic health records, with one study leveraging rigorously collected trial data.
Across studies, sample sizes varied substantially, ranging from small cohorts of fewer than 50 patients [
16] to large real-world datasets exceeding 1500 eyes [
29]. The study populations were predominantly elderly, with mean or median ages ranging from approximately 70 to 80 years, consistent with the epidemiology of neovascular AMD. Female representation varied across studies, from 33.2% to 60.6%, and was not uniformly reported. All studies evaluated patients receiving anti-vascular endothelial growth factor (anti-VEGF) therapy, most commonly ranibizumab and aflibercept, with some including brolucizumab and bevacizumab. Treatment protocols generally followed standard clinical practice, consisting of three-monthly loading injections followed by pro re nata or treat-and-extend regimens, although protocol-driven extension was applied in the randomized trial setting. Outcome assessments varied across studies, encompassing short-term anatomical response, early recurrence after fluid resolution, and functional visual outcomes at 12 months, highlighting heterogeneity in target definitions for treatment response prediction.
3.3. Overview of Machine Learning Models and Outcome Definitions
Table 3 and
Table 4 summarize the input data modalities, model architectures, validation strategies, and outcome definitions of the machine learning approaches used to predict treatment response in neovascular age-related macular degeneration (AMD). Across all included studies, spectral-domain optical coherence tomography (SD-OCT) served as the sole imaging input modality, underscoring its central role in contemporary AMD management and its suitability for data-driven modeling. Most studies relied exclusively on baseline or pre-injection OCT scans [
15,
16,
26,
27,
28,
29], while one study incorporated pre-treatment imaging explicitly defined relative to injection timing [
25]. Models were evaluated at the eye or patient level as reported; several studies explicitly enforced one eye per patient, whereas others included bilateral eyes (more eyes than patients) without clearly reported clustering adjustments.
A range of machine learning paradigms was employed, reflecting methodological heterogeneity across studies. Convolutional neural network (CNN)-based architectures were commonly used, including DenseNet-201 backbones [
25,
28], as well as hybrid approaches integrating CNN-derived features with structured clinical variables or gradient boosting models [
26,
27]. Several studies adopted radiomics-based feature extraction from OCT images, including two-dimensional pigment epithelial detachment texture features [
16] and three-dimensional OCT radiomics [
15]. One study leveraged an automated commercial platform (Google Cloud AutoML Tables) combining neural networks and gradient-boosted decision trees [
29]. Validation strategies varied substantially, encompassing k-fold cross-validation, leave-one-out cross-validation, repeated cross-validation runs, internal hold-out validation, and stratified train–test splits, highlighting differences in model robustness assessment and risk of overfitting.
Threshold reporting is inconsistent: many papers report sensitivity/specificity but do not specify the operating-point selection rule, while one study explicitly reports a default probability cutoff (0.5) and explores threshold tuning (0.4). Target outcomes and class definitions also differed across studies (
Table 4), reflecting the absence of a unified definition of treatment response in neovascular AMD. Anatomical outcomes were most frequently evaluated, including short-term fluid resolution, early recurrence after achieving a dry macula, and durability of anatomical response [
15,
16,
25,
28]. Functional outcomes, primarily visual acuity at 12 months, were assessed in several studies using clinically meaningful thresholds such as Snellen line improvement, logMAR cutoffs, or ETDRS letter scores [
26,
27,
29]. Reference standards ranged from expert adjudication by experienced retinal specialists to quantitative OCT-derived fluid metrics and standardized best-corrected visual acuity measurements. This heterogeneity in outcome definitions and ground truth labeling represents a key methodological challenge for cross-study comparison and meta-analytic synthesis of AI-based treatment response prediction models.
3.4. Quality Assessment
Risk of bias and applicability were evaluated using the PROBAST-AI framework (
Table 5). Overall, six of the seven included studies were rated as having a high overall risk of bias, while one study demonstrated a moderate risk profile. The analysis domain was the principal contributor to high risk, largely due to the absence of external validation, limited sample sizes relative to model complexity, and incomplete reporting of calibration, overfitting mitigation, and missing data handling. The participant’s domain was generally judged to be at moderate risk, reflecting the predominance of retrospective, single-center cohorts. Similarly, the outcomes domain showed moderate risk because of heterogeneity in outcome definitions and variability in reference standards across studies.
In contrast, the predictors domain was consistently rated as low risk of bias, as all models relied on baseline or pre-treatment OCT images that are routinely available at the time of treatment initiation. Applicability concerns were uniformly low, indicating that study populations, predictors, and clinical contexts were broadly representative of real-world anti-VEGF treatment decision making in neovascular age-related macular degeneration.
3.5. Overall Meta-Analysis of Predictive Accuracy of OCT-Based Machine Learning Models
Han et al. (2024) [
25] showed an extreme operating point (sensitivity = 0.97, with specificity = 0.01), far outside the range of other anatomical outcome studies (specificity ~0.40–0.80). This pattern indicates a near “always positive” classification threshold and produced disproportionate influence on pooled specificity. We therefore present primary pooled estimates excluding this outlier and provide sensitivity analyses including it (
Supplementary Figure S2). The bivariate random-effects model (six studies) estimated a pooled sensitivity of 0.79 (95% CI 0.68–0.87) and pooled specificity of 0.83 (95% CI 0.62–0.94) for OCT-based machine learning prediction of anti-VEGF treatment outcomes in neovascular AMD. There was substantial between-study heterogeneity (generalized I
2 = 74.36%; sensitivity I
2 = 67.14%; specificity I
2 = 84.91%) with moderate positive correlation between sensitivity and specificity across studies (ρ = 0.52). A likelihood-ratio test strongly favored the random-effects specification over a fixed-effects model (χ
2 = 144.42; df = 3;
p < 0.001), supporting meaningful variability in accuracy across studies. Study-level estimates ranged from 0.67 to 0.93 for sensitivity and 0.40 to 0.98 for specificity, with particularly low specificity in Jung et al.’s study (0.40) and comparatively high paired performance in Yeh et al.’s (sensitivity 0.93; specificity 0.94) and Kim et al.’s (specificity 0.98) studies. These results, including the corresponding forest plots, are presented in
Figure 2, while the summary receiver operating characteristic (SROC) plot is depicted in
Figure 3. No significant publication bias was observed for the included studies. (
p = 0.12,
Supplementary Figure S1).
3.6. Subgroup Analysis
We conducted subgroup analyses to explore sources of heterogeneity (
Table 6). Geographic region (Asian vs. Western studies) did not show a significant difference in performance (
p = 0.42 for sensitivity;
p = 0.43 for specificity), nor did study setting (single center vs. multicenter,
p = 0.49 and 0.12). This suggests that, broadly, model accuracy was not clearly dependent on region or single vs. multicenter data source, although only one multicenter study was included.
Model feature type (radiomics vs. deep learning) similarly showed no significant differences (sensitivity ~0.80 in both; specificity 0.72 vs. 0.87, p = 0.42 for specificity). This indicates that radiomics-based approaches performed comparably to deep learning models on average, a noteworthy finding given debates about handcrafted features versus deep features. It implies that no approach had an inherent advantage across these studies, although sample sizes were small (N = 2 radiomics studies).
In contrast, prediction horizon and outcome type appeared to influence specificity. Models predicting long-term outcomes (typically 12-month results) achieved much higher pooled specificity (0.92, 95% CI = 0.83–0.96) than those predicting short-term outcomes (0.48, 95% CI = 0.21–0.76), with a statistically significant difference (p = 0.01). Sensitivity did not differ as much by horizon (short-term 0.73 vs. long-term 0.82, p = 0.40). Similarly, models predicting functional outcomes (visual acuity) had significantly higher specificity (0.94) than those predicting strictly anatomical outcomes (0.61, p = 0.02), while sensitivities were in a closer range (0.82 vs. 0.76, p = 0.48). These patterns suggest that short-term anatomical predictions (like early fluid recurrence) may be prone to high false-positive rates, whereas longer-term or functional predictions yielded cleaner separation of responders vs. non-responders.
Finally, studies with a larger sample size (>500 eyes) had higher specificity (0.94) than smaller studies (0.61, p = 0.02), although sensitivity again was similar (0.82 vs. 0.76). This may reflect that larger datasets allow models to be more finely tuned to avoid false alarms or that smaller studies reported more optimistic sensitivity at the expense of specificity. It underscores the value of adequate sample size in developing clinically precise models.
4. Discussion
4.1. Principal Findings
In this systematic review and meta-analysis, OCT-based AI/ML models demonstrated a pooled sensitivity of 0.79 and specificity of 0.83 for predicting treatment response in neovascular AMD, although with considerable heterogeneity across studies. Clinically, this suggests that current models have moderate-to-high discriminative ability to differentiate likely good responders from poor responders. Such performance might be useful for decision support (e.g., identifying high-risk eyes that may need closer monitoring or prioritization for further assessment), but it is not yet at the level for standalone decision-making. Notably, these discrimination metrics do not directly establish clinical utility: without prespecified clinical pathways or decision-analytic evaluation (e.g., net benefit/decision curves), we cannot quantify whether using predictions would improve outcomes or reduce unnecessary treatment changes. The substantial variability in accuracy between studies indicates that real-world performance will depend heavily on local factors like patient population and how “response” is defined. Importantly, even the best-performing models still have a trade-off: for instance, a sensitivity around 0.8 means about 20% of suboptimal responders might be missed, and a specificity around 0.8 means about 20% of predicted non-responders would do well (potentially leading to unnecessary treatment changes if acted upon). These are promising results but also a reminder that the models are far from perfect.
From an ML perspective, one key finding is that no single type of modeling approach categorically outperformed others. Radiomics-based feature models achieved similar accuracy to deep learning models in our analysis. This implies that the features predictive of anti-VEGF response can be captured by different means—whether via handcrafted imaging features or automatically learned deep features—and that success may hinge more on data quality and study design than on algorithm novelty. Furthermore, the higher specificity observed in long-term functional outcome predictions suggests that predictive signals in baseline OCT may translate better to broad, downstream outcomes (like final visual acuity) than to short-term anatomical fluctuations that can be noisy and protocol dependent.
4.2. Machine Learning and Knowledge Extraction
A useful lens to interpret these results is to consider what “knowledge” ML models extract from OCT images. OCT features known to correlate with outcomes—particularly fluid compartments, lesion morphology, and outer retinal integrity—have been widely studied as prognostic biomarkers in nAMD [
7,
8,
9]. Radiomics formalizes this knowledge by converting OCT into quantitative descriptors (texture, shape, and intensity), aligning with the paradigm that images are data [
10,
15,
16]. Deep learning, by contrast, learns internal representations directly from pixel data and has performed strongly in ophthalmic imaging tasks [
11,
12]; it can also automate detection of relevant features (e.g., fluid) [
13], potentially standardizing inputs. In our meta-analysis, radiomics and deep learning approaches showed no significant difference in average performance, suggesting that the underlying signal linking baseline OCT patterns to treatment outcomes can be captured by both paradigms. Importantly, apparent performance is often driven as much by validation design as by model type. Several included studies relied on internal random splits or non-nested cross-validation, which can yield optimistic estimates when feature selection/hyperparameters are effectively tuned on the same data used for evaluation. In addition, data-leakage risks—such as image-level splitting when multiple scans from the same eye/patient exist or unclear handling of bilateral eyes—can inflate discrimination by allowing correlated information to appear across training and test sets. By contrast, external or temporal validation (training on earlier data and testing on later cohorts or different centers/devices) was uncommon, limiting confidence in transportability across real-world settings. These ML-specific design choices plausibly contribute to both the heterogeneity we observed and the gap between proof-of-concept performance and deployment-ready reliability. Therefore, rather than algorithm novelty alone, factors such as outcome definition, representativeness of the dataset, leakage control, and rigor of validation are likely key determinants of downstream utility [
22,
30].
4.3. Why Sensitivity/Specificity Matters for Clinical Use
Many AI studies in ophthalmology report AUROC as a summary of performance, but moving toward clinical applicability requires fixing decision thresholds and evaluating sensitivity and specificity (or related metrics like predictive values). In the context of nAMD treatment, a model with high sensitivity would be valuable to catch nearly all poor responders early (avoiding undertreatment), whereas a model with high specificity would help avoid falsely labeling good responders as failures (preventing overtreatment or unnecessary switches). The pooled sensitivity (~0.79) and specificity (~0.83) we found indicate a reasonably good balance suitable for a triage or supportive role. For example, a clinic might use a sensitive model to flag patients at high risk of poor response so they can be monitored more closely or considered for adjunctive therapies; concurrently, the specificity ~0.83 suggests that about 17% of flagged “at-risk” patients could turn out to respond well anyway, which is a manageable false-positive rate in a monitoring context. Our decision to meta-analyze threshold-based metrics was driven by this need to interpret model performance in a decision-making frame. By contrast, if we only looked at AUROCs (which several included studies reported in the 0.70–0.85 range), we might conclude that the models are “pretty good”, but that would obscure how they might actually be used. Framing the results in terms of sensitivity/specificity makes the findings more actionable for clinicians who might integrate AI predictions into their treatment decisions. However, threshold-based metrics are inherently operating point dependent, and the included studies often used different or incompletely reported thresholding strategies (e.g., implicit default cutoffs, data-driven optimization, or unspecified operating points). Therefore, the pooled sensitivity and specificity should be interpreted as a summary of the performance at the reported operating points across studies rather than performance at a harmonized clinical threshold. Future work should prespecify clinically meaningful operating points and report threshold selection and calibration more transparently. Moreover, even well-calibrated sensitivity/specificity does not, by itself, demonstrate benefit in practice. Decision-analytic evaluation (e.g., decision curves/net benefit) is needed to translate predicted risk into actionable thresholds and explicit management pathways (e.g., intensified monitoring vs. early switch vs. durability-guided dosing) and to quantify the trade-off between missed poor responders and unnecessary escalation. Because such analyses were rarely reported in the primary studies, our findings should be interpreted as evidence of discrimination rather than proven clinical utility.
4.4. Interpreting Heterogeneity and Moderator Patterns
We observed that heterogeneity in model performance was partly structured: models predicting short-term anatomical outcomes (like fluid status after three injections) showed much lower specificity on average than those predicting 12-month outcomes. One interpretation is that short-term OCT changes are influenced by numerous factors including transient physiological variation and strictness of grading criteria, making it harder for models to distinguish noise from true non-response. In contrast, a 12-month outcome (especially a functional one like final vision) integrates the effect of the entire treatment course and may be less sensitive to any single OCT visit’s quirks. Thus, models targeting longer-term endpoints might inherently achieve cleaner separation between responders and non-responders (hence higher specificity). Another finding was that larger sample sizes were associated with higher specificity. This could be due to better training of models to recognize truly negative cases or possibly due to publication bias (smaller negative studies might not be published), though our funnel plot did not show clear asymmetry.
The lack of difference in sensitivity or specificity between Asian and Western studies, and between single vs. multicenter studies, is somewhat reassuring in terms of generalizability: it suggests that the concept of using OCT to predict treatment outcome is not confined to one healthcare system or ethnicity. However, given the limited number of studies, this should be interpreted cautiously.
These subgroup analyses are exploratory, given only six studies in the meta-analysis. We treated these findings as hypothesis generating. For example, the idea that functional outcome prediction yields higher specificity needs confirmation in future meta-analyses when more studies are available. It might be confounded by other factors (indeed, the functional outcome studies also tended to be larger and more recent). Multivariable meta-regression was not feasible with so few studies, but as the field grows, it would be valuable to disentangle whether horizon, outcome type, sample size, or region independently affect performance. For now, our results hint that the prediction of long-term, functionally relevant outcomes might be the “sweet spot” for AI in nAMD, an insight that could inform what future studies prioritize.
4.5. Risk of Bias, Generalizability, and Reporting Gaps
The fact that almost all included studies were rated at high risk of bias (using PROBAST+AI) highlights a general issue in the current literature: many studies are in a proof-of-concept phase rather than a deployment-ready phase. Common limitations were lack of external validation (models often tested only on randomly split internal data), potential overfitting (complex models on small datasets), and incomplete reporting (e.g., not stating how missing OCT scans were handled or whether readers were masked to outcomes when grading OCTs). These issues can inflate reported performance. For instance, a model might appear highly sensitive in a single-center test but fail to generalize to another center’s OCT machine or patient demographics. This phenomenon has been noted across AI in healthcare, where initial excitement is tempered by later real-world evaluation. Indeed, our meta-analysis found substantial heterogeneity, which could partly stem from some models overfitting to their narrow training context.
To advance, the community should adopt established reporting standards for prediction model studies. Using checklists like TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) [
31] and STARD (Standards for Reporting Diagnostic Accuracy) [
32] can improve clarity on how studies were conducted, making it easier to judge bias. Additionally, AI-specific extensions such as CONSORT-AI for trials [
33] and SPIRIT-AI for protocols [
34] will become relevant once these predictive models move into prospective trials. In our review, only one study was a trial-based analysis [
15], but we anticipate more prospective studies or implementations in the near future.
Another aspect is model interpretability and clinical integration. None of the studies focused on explainable AI techniques, yet for an ophthalmologist to trust and use a prediction, it helps to know why a model is saying a patient is high risk (e.g., “because there is subretinal fibrous tissue on the OCT that often correlates with poor response”). Incorporating explainability or at least performing feature importance analyses (as done in one study [
19]) can mitigate the “black box” concern. Recent MAKE research has explored methods to improve interpretability of deep models in medical imaging [
35], which could be applied here to increase clinician trust in AI predictions.
Finally, many studies had narrow definitions of response (like anatomy after three injections), which might not align with what patients and clinicians ultimately care about (vision and long-term disease control). We need consensus on outcome definitions for “AI-ready” endpoints in nAMD. Efforts by groups like the AMD Outcomes Initiative or the Macular Society could be valuable in standardizing what constitutes a responder, a partial responder, etc., in the context of routine care data. This would make it easier to pool and compare studies and to train models on larger combined datasets.
4.6. Implications and Next Steps
Our findings have several implications for both clinical practice and ML research in ophthalmology. First, the promise shown by these models suggests that in the near future, ophthalmologists could have access to an OCT-based tool that, at baseline, flags patients at high risk of needing more intensive therapy. This might influence how aggressively to treat (e.g., consider early switch from one anti-VEGF agent to another or the addition of steroids, etc., in a patient predicted to respond poorly to standard therapy). However, given the heterogeneity and bias risks, no model is ready for prime time yet. Before deployment, we need prospective validation in multicenter, multi-device studies. Ideally, an adaptive trial could test whether having AI predictions improves patient outcomes or resource allocation.
For the ML community, our review highlights the need to focus on generalizability and robustness. This might involve training models on more diverse datasets (e.g., using federated learning across hospitals), performing extensive external validation, and incorporating techniques to handle domain shifts (like when a new OCT device is used). It also underscores that bigger data is not the only answer: careful study design and avoiding information leakage are equally critical. Many of the high-bias studies could have been improved by methods like nested cross-validation for model selection or using an independent test set from a later time.
Another future direction is to combine imaging with other data (genetic markers and systemic factors) to see if predictions improve. Some known predictors of anti-VEGF response include baseline visual acuity and certain risk alleles; integrating these with OCT might yield more powerful models. Also, as new therapies (e.g., longer-acting biologics or gene therapy) emerge, the definition of “response” may evolve (for instance, durability might become an even more crucial outcome). AI models will need to adapt to these changes and be revalidated.
Lastly, the relatively good performance of both radiomics and deep learning suggests a possible synergy: radiomics features could be used to inform or cross-check deep models (“hybrid” models). This might improve interpretability as well, since radiomics features are easier to relate to clinical concepts (e.g., “texture feature X corresponds to subretinal fibrosis”). Approaches that generate human-understandable features as intermediate outputs (like quantifying all relevant lesion features via segmentation and then predicting outcome) could strike a balance between deep learning’s power and radiomics’ transparency.
In summary, our meta-analysis provides a benchmark of current AI capabilities in predicting nAMD treatment outcomes and offers insights to guide future research. We emphasize that achieving clinical impact will require not just incremental model accuracy gains but also addressing bias, improving interpretability, and rigorously demonstrating generalizability. From a machine learning standpoint, the comparable performance of deep and feature-based models indicates that algorithm choice may be less critical than ensuring robust study design, adequate sample size, and standardized endpoints. Future AI development in ophthalmology should therefore prioritize generalizability, interpretability, and bias mitigation to effectively translate these predictive tools into practice.