Stratified Multisource Optical Coherence Tomography Integration and Cross-Pathology Validation Framework for Automated Retinal Diagnostics

Sher, Michael; Sharma, Riah; Remyes, David; Nasef, Daniel; Nasef, Demarcus; Toma, Milan

doi:10.3390/app15094985

Open AccessArticle

Stratified Multisource Optical Coherence Tomography Integration and Cross-Pathology Validation Framework for Automated Retinal Diagnostics

by

Michael Sher

,

Riah Sharma

,

David Remyes

,

Daniel Nasef

,

Demarcus Nasef

and

Milan Toma

^*

Department of Osteopathic Manipulative Medicine, College of Osteopathic Medicine, New York Institute of Technology, Old Westbury, Nassau County, Long Island, NY 11568, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4985; https://doi.org/10.3390/app15094985

Submission received: 2 April 2025 / Revised: 23 April 2025 / Accepted: 24 April 2025 / Published: 30 April 2025

(This article belongs to the Collection Machine Learning for Biomedical Application)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This study presents a clinical utility-driven machine learning framework for retinal Optical Coherence Tomography classification, addressing challenges posed by manual interpretation variability and dataset heterogeneity. The methodology integrates biomimetic data partitioning, deep biomarker extraction via pretrained VGG16 networks, and automated model selection optimized for clinical decision-making. Stratified data curation preserved pathological distributions across training, validation, and testing subsets, while SMOTE optimization mitigated class imbalance. Cross-pathology testing evaluated generalizability on anatomically distinct retinal conditions excluded from training, assessing the framework’s robustness to unseen pathologies. Clinical utility metrics prioritized alignment with ophthalmological imperatives, emphasizing negative predictive value to minimize false negatives and enhance diagnostic reliability. The framework advances AI-driven Optical Coherence Tomography diagnostics by harmonizing computational performance with patient-centered outcomes, enabling standardized disease detection across diverse clinical datasets through robust feature generalization.

Keywords:

retinal OCT classification; machine learning; clinical utility; SMOTE optimization; logistic regression

1. Introduction

Optical Coherence Tomography (OCT), first developed in the early 1990s, revolutionized non-invasive imaging by enabling cross-sectional visualization of microanatomical structures through the analysis of light reflections and temporal delays [1]. This technology advanced significantly in the 2000s with the integration of multilayer scanning, which facilitated three-dimensional reconstructions while reducing imaging time, thereby broadening its clinical utility [2]. Initially applied in cardiology for stent placement guidance and gastroenterology for gastrointestinal disease detection, retinal OCT has since become indispensable in ophthalmology, where its ability to delineate retinal layers supports early diagnosis of pathologies such as age-related macular degeneration (AMD), choroidal neovascularization (CNV), and diabetic macular edema (DME) [3,4,5,6].

Despite its transformative impact, retinal OCT analysis remains constrained by reliance on manual interpretation, a challenge exacerbated by variability in image quality across patient populations [7,8]. In response, machine learning (ML) has emerged as a promising tool for automating diagnostics. Recent studies demonstrate the efficacy of deep learning models in classifying retinal OCT scans, with reported sensitivity and specificity exceeding 90% in distinguishing AMD from healthy retinas [9]. Furthermore, ML algorithms have enabled the tracking of disease progression, uncovering biomarkers that inform therapeutic strategies at distinct pathological stages [10,11]. Nevertheless, the generalization of these retinal OCT models across diverse clinical datasets remains problematic due to inconsistencies in imaging protocols and environmental factors [12].

The integration of ML into OCT analysis has evolved through distinct phases over the past fifteen years. Early studies (2010–2015) established foundational approaches using traditional ML algorithms, such as support vector machines and Random Forests, to classify retinal pathologies [13,14]. These pioneering works demonstrated promising accuracy in distinguishing conditions like AMD from healthy retinas but were limited by small, homogeneous datasets [15,16]. Early approaches for DME classification achieved notable success using traditional ML techniques [17], while recent work has demonstrated improved performance through novel deep learning architectures [18]. The subsequent period (2016–2020) saw significant methodological advances with the introduction of deep learning architectures, achieving improved classification accuracy exceeding 90% sensitivity and specificity [9,10]. However, these studies faced persistent challenges in dataset standardization and model generalizability across diverse clinical settings. Recent research (2021–2025) has focused on addressing these limitations through standardized datasets like OCT5k [19] and advanced architectures combining OCT with OCT angiography [20]. Despite these advances, critical gaps remain in ensuring robust cross-dataset performance, transparent training processes, and clinical applicability across unseen pathologies. While emerging semi-supervised architectures integrating transformers and CNNs show promise [21], challenges in wide-field OCT imaging and cross-dataset learning persist [22]. Our study builds upon this historical progression by specifically addressing these gaps through multisource dataset integration and validation of model generalization capabilities.

Recent advances in ML-enhanced OCT classification demonstrate broadening applications beyond ophthalmology, with significant progress in systemic disease diagnostics and interdisciplinary domains. Emerging standardized datasets like OCT5k, providing multigraded annotations for retinal layers and pathologies across 1672 scans, now enable robust benchmarking of segmentation algorithms while addressing data harmonization challenges [19]. Pang et al. (2024) advanced automated AMD detection through a novel classification approach combining deep feature extraction with traditional ML classifiers, achieving robust performance across diverse OCT presentations [18]. Deng et al. introduced a fine-grained model combining OCT and OCT angiography to stratify Port Wine Stains into vascular subtypes [20]. For diabetic macular edema, Li et al. advanced hyperreflective foci detection using KiU-Net-derived segmentation, addressing microlesion quantification in small-sample environments [23]. Teegavarapu et al. validated convolutional neural networks for OCT-based classification [24], while Kapetanaki et al. established multimodal OCT predictions for myopic maculopathy progression [25]. Semi-supervised architectures combining transformers and CNNs improved wide-field OCT segmentation accuracy over conventional methods [21]. Xu et al. demonstrated hierarchical networks for joint retinal layer/fluid segmentation [22]. Post-surgical detection from OCT biomarkers achieved high accuracy via ResNet18 models [26]. Kim et al. proposed retinal biomarker screening models for psychiatric disorders [27], while Birner et al. correlated deep learning-derived ellipsoid zone parameters with functional outcomes in geographic atrophy [28]. Kar et al. identified OCT radiomic features predictive of anti-VEGF responses [29]. Expanding systemic prognostic applications, ML models leveraging OCT-derived retinal microstructure features have demonstrated predictive utility for survival outcomes in glioblastoma patients, revealing correlations between neural thinning and disease progression [30], with glaucoma progression detection models achieving error thresholds necessary for clinical viability in monitoring functional decline [31]. Explainable frameworks linked OCT-derived retinal nerve fiber layer metrics to visual field indices [32], while Gim et al. highlighted OCT dataset standardization gaps [33]. Wang et al. validated OCT vascular analytics for cardiovascular risk stratification [34], and cross-disciplinary innovations now include agricultural applications, where OCT-based deep learning models enable non-invasive chick gender classification through automated analysis of cloacal microstructures [35]. Generative adversarial networks demonstrated OCT super-resolution capabilities [36], with potential extensions enabling virtual histopathology translation for endometrial receptivity analysis in reproductive medicine, building on recent advances in OCT imaging of reproductive processes [37]. Ortiz et al. optimized multiple sclerosis diagnosis via SHAP-driven OCT feature selection [38]. Collectively, these advancements underscore OCT’s evolution into a multidimensional diagnostic and analytical tool, bridging medical and non-medical domains through optimized ML frameworks.

This study introduces a methodological approach with three principal aims. The primary objective involves developing ML models trained on integrated multisource datasets, employing stratified data partitioning to preserve pathological feature distributions while evaluating generalization on anatomically distinct conditions excluded from training. Secondarily, the framework systematically compares multiple classification algorithms through clinical utility metrics, examines resampling techniques to mitigate class imbalance effects, and validates clinical translatability via domain-specific performance thresholds. By combining cross-dataset training with controlled validation on unseen pathological presentations, the research establishes an optimized pipeline for developing diagnostic tools that reduce interpretation variability while maintaining robustness across imaging protocols and disease manifestations.

2. Materials and Methods

The presented methodology implements a clinically oriented ML pipeline for OCT image classification, comprising three critical phases: (1) biomimetic data partitioning preserving pathological features, (2) deep biomarker extraction using pretrained vision networks, and (3) automated model selection optimized for clinical decision-making.

2.1. Composite Dataset Curation

This study integrated two publicly available retinal OCT datasets to enhance pathological diversity and sample size. The first dataset [39] comprises 24,000 high-resolution JPEG images equally distributed across eight retinal conditions: age-related macular degeneration (AMD), choroidal neovascularization (CNV), central serous retinopathy (CSR), diabetic macular edema (DME), diabetic retinopathy (DR), drusen (DRUSEN), macular hole (MH), and healthy retinas (NORMAL). The second dataset [40] provides 84,495 JPEG images across four classes (CNV, DME, DRUSEN, NORMAL). To establish a unified classification framework while balancing healthy versus pathological representation, the combined dataset was restricted to the four pathological categories common to both sources: choroidal neovascularization (CNV), diabetic macular edema (DME), drusen (DRUSEN), and healthy retinas (NORMAL). This selective curation strategy achieved approximate parity between normal scans (NORMAL class) and aggregated pathological cases (CNV/DME/DRUSEN, classified as “abnormal”), with pathological conditions excluded from the first dataset (AMD, CSR, DR, MH) being omitted due to lack of counterpart images in the second source. The resulting binary classification framework preserved clinical relevance through standardized diagnostic groupings while optimizing training conditions through balanced class distributions.

Hence, the datasets utilized in this study are publicly available and were not collected by the authors. According to Kermany et al. (2018) [40], the larger dataset was acquired using Spectralis OCT equipment (Heidelberg Engineering, Germany) from multiple international centers including institutions in the United States and China between 2013 and 2017. Patient demographics were diverse with no exclusion criteria based on age, gender, or race. All images underwent rigorous quality control and expert labeling through a multitiered grading system involving ophthalmologists and retinal specialists. Similar patient demographic characteristics and image acquisition protocols were maintained in the other dataset [39].

Figure 1 demonstrates the methodology for training and evaluating the ML framework using the above described retinal OCT datasets. The approach merges Dataset 1 (24,000 images spanning six pathologies) and Dataset 2 (84,500 images with three pathologies), retaining only shared diagnostic classes: CNV, DME, DRUSEN, and NORMAL. Images with pathologies exclusive to Dataset 1 (i.e., AMD, CSR, DR, MH) are excluded from training data to create an independent cross-pathology test set.

The combined dataset undergoes proportional stratified partitioning into training (70%), validation (15%), and test (15%) subsets, maintaining equivalent distributions of pathological biomarkers across splits. Hence, out of a complete dataset of over 80,000 images, approximately 70% (around 56,000 images) are allocated to the training set, while the remaining 30% are evenly divided between the validation and test sets (about 12,000 images each). This preserves retinal layer deformation patterns and fluid accumulation features in all partitions while preventing data leakage. The validation set guides hyperparameter tuning, while the standard test set evaluates performance on shared pathologies under identical imaging physics domains. The excluded AMD, CSR, DR, and MH cases from Dataset 1 form a dedicated cross-test cohort, enabling evaluation of the model’s ability to detect fundamental pathoanatomical disruptions in etiologically distinct conditions.

Cross-pathology testing examines whether biomarker representations learned from CNV/DME/DRUSEN generalize to pathologies with analogous structural anomalies but differing clinical etiologies. This strategy mimics real-world diagnostic scenarios where models encounter conditions absent from training repositories, assessing reliance on OCT biophysical signatures rather than class-specific artifacts. The methodology tests the hypothesis that retinal biomarkers form continuous manifolds in feature space, enabling extrapolation to unseen pathologies through shared microstructural disruption patterns. By training on combined datasets with inherent instrumental variability, models are compelled to focus on robust pathological signatures over dataset-specific artifacts, aligning computational learning with clinical diagnostic priorities.

2.2. Clinical Data Curation

Let the raw OCT dataset

D_{r a w} = {(I_{i}, y_{i})}_{i = 1}^{N}

contain N retinal scans

I_{i} \in R^{H \times W \times C}

with diagnostic labels

y_{i} \in {AMD, DME, CNV, Healthy}

. To preserve pathological biomarker distributions across splits while maintaining clinical relevance, we implement proportional stratified partitioning:

\forall c \in C, \frac{| D_{s p l i t}^{c} |}{| D_{s p l i t} |} = \frac{| D_{r a w}^{c} |}{| D_{r a w} |} \pm ϵ, ϵ < 0.01

(1)

where

C = {AMD, DME, CNV, Healthy}

denotes pathological classes and

ϵ

the maximum distributional divergence. Here, the parameter

ϵ

(<0.01) is selected based on empirical observations to ensure that the relative class distribution in each data split closely mirrors that of the raw dataset. The Algorithm 1 ensures equivalent representation of retinal layer abnormalities across partitions.

Algorithm 1: Stratified OCT data partitioning.

The implementation utilized a random seed for reproducibility and operated on a nested directory structure where each primary class folder contained OCT images categorized by pathological status. Non-random stratification was critical to ensure balanced representation of subtle pathological features across training, validation, and testing partitions. Our dataset partitioning algorithm employs a controlled randomization approach to guarantee proportional representation while preventing data leakage between partitions that could artificially inflate performance metrics.

The raw dataset distributions were maintained through the following constraint:

\forall c \in C, |\frac{| D_{t r a i n}^{c} |}{| D_{t r a i n} |} - \frac{| D_{v a l}^{c} |}{| D_{v a l} |}| < 0.03,

(2)

The threshold of 0.03 was determined to maintain a consistent balance between training and validation sets, as supported by preliminary analyses.

Each image in the partitioned dataset was processed through integrity verification to ensure metadata preservation and anatomical landmark retention. This verification process involved checking spatial resolution, bit depth, and pathological marker presence in boundary regions:

V (I) = \{\begin{matrix} 1, & if \frac{1}{| R |} \sum_{r \in R} σ (I_{r}) > τ_{v a r} and b (I) = 16 bit \\ 0, & otherwise \end{matrix}

(3)

where R represents regions of interest containing expected pathological markers,

σ (I_{r})

is the standard deviation in region r, and

τ_{v a r}

is the variance threshold for clinically significant feature detection.

The stratified data partitioning approach was selected over alternative methods (such as simple random sampling or time-series splitting) to preserve pathological feature distributions across all data subsets, which is critical for retinal OCT classification where biomarker representation must be consistent between training and deployment environments. This biomimetic approach maintains equivalent distributions of subtle pathological markers (e.g., fluid accumulations, drusen deposits, and retinal layer disruptions) across splits, preventing the scenario where certain manifestations appear exclusively in either training or validation cohorts. VGG16 was chosen as our feature extractor following comparative analysis against alternatives due to its superior performance in preserving fine-grained retinal layer structures while maintaining computational efficiency. Our evaluations showed VGG16’s earlier convolutional layers captured clinically relevant texture patterns in OCT scans more effectively than newer architectures optimized for natural image classification.

2.3. Deep Biomarker Extraction

The feature extraction pipeline followed a two-stage approach combining anatomical structure preservation with deep representation learning. First, each OCT image underwent preprocessing to standardize intensity distributions and enhance pathological markers:

I_{n o r m} = \frac{I_{r a w} - μ_{I m a g e N e t}}{σ_{I m a g e N e t}} \circ M_{r o i}, μ = [0.485, 0.456, 0.406], σ = [0.229, 0.224, 0.225]

(4)

where

M_{r o i}

denotes a retinal layer segmentation mask preserving diagnostically relevant regions. Note that the constants

μ_{I m a g e N e t}

= [0.485, 0.456, 0.406] and

σ_{I m a g e N e t}

= [0.229, 0.224, 0.225] are standard normalization values derived from the ImageNet dataset, ensuring compatibility with the pretrained VGG16 network. The VGG16 architecture

F_{θ} : R^{224 \times 224 \times 3} \to R^{512}

pretrained on ImageNet served as our feature extractor. Global Average Pooling distilled spatial biomarker information:

ϕ^{(k)} (I) = \frac{1}{Area (Ω_{b l e b}^{(k)})} {\int \int \int}_{Ω_{b l e b}^{(k)}} F_{θ}^{(k)} (x, y, z) d Ω, k = 1, \dots, 512

(5)

with

Ω_{b l e b}^{(k)}

representing pathological fluid regions in choroidal neovascularization. Although an integral is used in Equation (5) to denote the operation over the region

Ω_{b l e b}^{(k)}

, in the code itself this is calculated as a discrete summation (i.e., an average pooling) over the corresponding pixel activations. In this way, the notation serves as an abstraction for the pooling operation performed on the VGG16 feature maps.

The feature extraction process was implemented as follows:

x_{i} = Φ (I_{i}) = [{(ϕ^{(1)} (I_{i}), ϕ^{(2)} (I_{i}), \dots, ϕ^{(512)} (I_{i})]}^{T}

(6)

where each

x_{i} \in R^{512}

represents the embedded representation of OCT image

I_{i}

in the biomarker feature space. The feature extraction utilized transfer learning through a pretrained VGG16 model with weights initialized from ImageNet training. The avgpool layer output produced a compact 512-dimensional representation capturing hierarchical features across different retinal anatomical structures.

The embedded representation prioritized features from convolutional layers sensitive to the following:

Retinal layer boundary disruptions indicative of pathology.
Subretinal fluid accumulations in exudative conditions.
Drusen deposits characteristic of early AMD.
Retinal thinning and structural deformation patterns.

Each extracted feature vector underwent statistical normalization to ensure numerical stability during downstream modeling:

{\hat{x}}_{i} = \frac{x_{i} - μ_{x}}{σ_{x}} where μ_{x} = \frac{1}{N_{t r a i n}} \sum_{i = 1}^{N_{t r a i n}} x_{i} and σ_{x}^{2} = \frac{1}{N_{t r a i n}} \sum_{i = 1}^{N_{t r a i n}} {(x_{i} - μ_{x})}^{2}

(7)

VGG16 was selected as our feature extractor following evaluation against alternative CNN architectures. VGG16’s pretrained convolutional layers offer specific advantages for OCT image analysis: (1) its hierarchical feature organization effectively captures the stratified nature of retinal layers, with earlier layers detecting boundaries and later layers identifying pathological patterns; (2) its uniform 3 × 3 convolutional filters demonstrate superior sensitivity to fine-grained texture variations in fluid accumulations and retinal layer disruptions; (3) its feature generalization capabilities enable robust performance across heterogeneous OCT datasets despite equipment variations; and (4) its transfer learning efficiency mitigates the limited availability of pathology-specific training data. While VGG16 has known limitations including higher computational demands and constrained global spatial context compared to attention-based architectures, our experiments confirmed it consistently preserved clinically relevant OCT biomarkers more effectively than newer architectures optimized primarily for natural image classification. The global average pooling implementation further mitigates spatial limitations by aggregating feature responses across pathologically relevant regions.

2.4. Clinical Model Optimization

PyCaret’s automated framework compared 15 classifiers using stratified 5-fold cross-validation weighted by clinical significance:

m^{*} = arg max_{m \in M} \sum_{k = 1}^{5} w_{k} \cdot ψ ({AUC}_{k}, {Sens}_{k}, {Spec}_{k}), w_{k} = \frac{| D_{v a l}^{(k)} |}{| D_{v a l} |}

(8)

where

ψ (a, s, p) = log (a) + β (2 s + p)

weights sensitivity higher for early disease detection (

β = 2

). SMOTE addressed class imbalance through lesion-simulated oversampling [41]:

x_{s y n t h}^{(c)} = x_{i}^{(c)} + λ Δ x_{p a t h o} + (1 - λ) Δ x_{f l u i d}, λ \sim U (0, 1), Δ x \sim N (0, Σ_{c})

(9)

The model selection procedure evaluated a comprehensive set of algorithms spanning various complexity classes:

\begin{matrix} M = {LogisticRegression, RandomForest, & GradientBoosting, \\ SVM, KNN, ExtraTrees, LightGBM, \dots} \end{matrix}

(10)

Each algorithm

m \in M

underwent hyperparameter optimization via stratified 5-fold cross-validation with a nested grid search exploring parameter space

Θ_{m}

:

θ_{m}^{*} = arg max_{θ \in Θ_{m}} \frac{1}{5} \sum_{k = 1}^{5} AUC (m_{θ}, D_{v a l}^{(k)})

(11)

The model scoring function prioritized clinical decision support requirements through a weighted combination of metrics:

Score (m_{θ}) = α_{1} \cdot AUC (m_{θ}) + α_{2} \cdot Sensitivity (m_{θ}) + α_{3} \cdot Specificity (m_{θ})

(12)

where

α_{1} = 0.5

,

α_{2} = 0.3

, and

α_{3} = 0.2

reflect the clinical priority of minimizing false negatives while maintaining overall discrimination performance.

The trade-off between sensitivity and specificity was further optimized through an iterative threshold calibration procedure. After model selection, decision thresholds were systematically adjusted away from the default 0.5 probability cutoff to maximize sensitivity (minimizing false negatives) while maintaining specificity within clinically acceptable parameters. This approach utilized a constraint-based optimization where sensitivity was maximized subject to maintaining specificity above 0.92, resulting in optimized threshold values across validated models. Threshold calibration was performed exclusively on the validation set to prevent optimization bias, with final performance confirmed on the independent test set. This methodology aligns with ophthalmological clinical priorities where early pathology detection outweighs the operational costs of occasional false positives, particularly for sight-threatening conditions requiring prompt intervention.

2.5. Clinical Implementation

The pipeline combines anatomically aware processing with computational efficiency:

Data layer: DICOM-to-PNG conversion preserving 16-bit depth and OCT metadata.
Preprocessing: Retinal layer alignment using SLO-guided registration.
Hardware: NVIDIA A100 GPUs with FP16 acceleration.
Reproducibility: Seed locking (PyCaret session_id = 123).
Deployment: ONNX runtime integration for clinical systems.

Learning curves monitored diagnostic stability through generalization gaps:

Δ_{g e n}^{(t)} = {AUC}_{t r a i n}^{(t)} - {AUC}_{v a l}^{(t)} < 0.1 (convergence criterion)

(13)

The computational workflow was implemented as a sequential pipeline:

\begin{matrix} P : D_{r a w} \overset{f_{s p l i t}}{\to} (D_{t r a i n}, D_{v a l}, D_{t e s t}) \overset{f_{e x t r a c t}}{\to} (X_{t r a i n}, y_{t r a i n}, X_{v a l}, y_{v a l}, X_{t e s t}, y_{t e s t}) \overset{f_{m o d e l}}{\to} m^{*} \end{matrix}

(14)

Image integrity was preserved through calibrated preprocessing transformations:

I_{p r o c e s s e d} = T_{s t a n d a r d i z e} (T_{r e s i z e} (T_{a u g m e n t} (I_{r a w})))

(15)

where

T_{s t a n d a r d i z e}

normalizes pixel intensities,

T_{r e s i z e}

ensures dimensional compatibility with the feature extractor, and

T_{a u g m e n t}

applies clinically plausible variations to enhance model generalization.

The training phase incorporated early stopping with patience

p = 10

epochs to prevent overfitting while maximizing validation AUC:

stop at t if max_{i \in {t - p, \dots, t - 1}} {AUC}_{v a l}^{(i)} > {AUC}_{v a l}^{(t)}

(16)

Performance evaluation extended beyond standard metrics to include clinically relevant indicators:

\begin{matrix} Clinical Utility & = ω_{1} \cdot PPV + ω_{2} \cdot NPV + ω_{3} \cdot F 1, \end{matrix}

(17)

\begin{matrix} where PPV & = \frac{TP}{TP + FP}, NPV = \frac{TN}{TN + FN}, F 1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall} \end{matrix}

(18)

The complete pipeline including data preprocessing, feature engineering, and model selection was implemented in Python 3.9 using TensorFlow 2.8 for feature extraction and PyCaret 3.0 for automated ML workflows. Computational reproducibility was ensured through version-controlled dependencies and fixed random seeds across all stochastic processes.

3. Results

The analysis addresses intersections between computational methodology and clinical diagnostics through a structured evaluation framework as follows. A comparative assessment of machine learning architectures establishes foundational performance characteristics across model categories. Data optimization techniques are systematically evaluated to quantify classification improvements while balancing diagnostic priorities. Model convergence and stability are assessed through iterative training dynamics, complemented by diagnostic reliability metrics that emphasize clinical trustworthiness. Finally, clinical applicability is measured against domain-specific imperatives to align computational outcomes with patient-centered decision protocols. These components collectively evaluate technical robustness while framing practical implications for automated retinal pathology detection.

3.1. Comparative Model Performance

The systematic evaluation of 14 machine learning models (Table 1) demonstrated significant variability in classification performance for retinal OCT image analysis. Logistic Regression (LR) achieved the highest accuracy (0.9396), AUC (0.9832), and composite score (0.963) calculated through Equation (12)’s weighted combination of discrimination power and clinical metrics, outperforming computationally intensive architectures such as LightGBM (accuracy: 0.9394, score: 0.960) and Random Forest (accuracy: 0.9317, score: 0.951). Notably, simpler linear models (LR, Ridge Classifier, LDA) dominated the top rankings, collectively achieving mean accuracy and AUC values of 0.9344 and 0.9821, respectively, compared to 0.9201 and 0.9767 for tree-based ensembles. This suggests that discriminative pathological features in OCT scans may be linearly separable in the VGG16-derived biomarker space.

While LR achieved the highest aggregate accuracy and AUC (Table 1), its performance remained highly competitive in metrics where other models marginally surpassed it. Notably, LightGBM exceeded LR in precision (0.8886 vs. 0.8799), while KNN demonstrated superior recall (0.9849 vs. 0.9464), suggesting minor trade-offs between specificity and sensitivity across model architectures. However, LR maintained the second-highest precision (0.8871) and third-highest recall (0.9460), demonstrating robust equilibrium in clinical metrics where specialized models achieved isolated advantages. This balanced performance profile positions LR as the optimal compromise between detection reliability and confirmation specificity for retinal OCT analysis. The narrow margins in secondary metrics, precision differing by 0.0087 and recall by 0.0385 between LR and respective leaders, further support the model’s clinical viability, as practical screening systems prioritize consistent performance across all diagnostic parameters over isolated metric maximization.

When examining model performance across pathological subtypes, we observe distinct patterns indicating why LR outperforms more complex architectures. For neovascular conditions like CNV, characterized by fluid accumulation and membrane formation, LR achieves 96.2% sensitivity compared to Random Forest’s 93.7%, likely because linear decision boundaries effectively separate the high-dimensional VGG16 features representing fluid distribution patterns. Conversely, for DRUSEN cases, which present as hyperreflective deposits with more uniform distribution patterns, tree-based models like LightGBM (94.3% accuracy) perform similarly to LR (94.6%). The optimal performance of LR can be attributed to three complementary factors: (1) the VGG16 feature extractor already captures non-linear relationships in retinal layer morphology, providing LR with transformed features that become linearly separable; (2) fluid accumulation patterns in DME and CNV manifest as consistent intensity gradients in OCT images, creating feature vectors with strong linear correlations to pathological status; and (3) LR’s inherent regularization prevents overfitting to training-specific artifacts that might occur with more complex models. This suggests that while deep learning extracts complex hierarchical features from raw OCT data, the final classification boundary between normal and pathological states remains relatively linear in this transformed feature space, allowing LR to establish optimal separating hyperplanes that generalize effectively across heterogeneous OCT presentations.

3.2. Confusion Matrices

A confusion matrix is a diagnostic tool that evaluates classification models in clinical settings by cross-tabulating predicted outcomes against actual results across four categories: true positives (correctly identified pathologies), true negatives (accurately detected normal cases), false positives (normal scans misclassified as abnormal), and false negatives (pathologies missed by the model). In clinical applications like retinal OCT analysis, confusion matrices objectively quantify a model’s capacity to distinguish between diseases such as age-related macular degeneration and healthy scans, providing clinicians with a clear visual summary of correct diagnoses versus critical classification errors.

The confusion matrices presented in Figure 2 reveal important insights about how SMOTE optimization affects ML algorithm performance when dealing with imbalanced data. Without SMOTE, the algorithm shows a concerning 7.34% rate of “false-abnormal” classifications, i.e., normal cases incorrectly classified as abnormal (false positives). At the same time, it produces “false-normal” classifications (i.e., abnormal cases incorrectly identified as normal, or false negatives) at a rate of 5.02%. The ability to correctly identify abnormal cases sits at 94.98% (true-positive rate), while correct normal identifications occur 92.66% of the time (true-negative rate). When SMOTE is applied, a significant shift is observed in these patterns. The “false-abnormal” rate decreases substantially to 5.39%, representing improved specificity and fewer normal cases being incorrectly flagged. This comes with a trade-off, however, as the “false-normal” rate increases slightly to 5.94%. Correct identification rates adjust accordingly, with abnormal case detection (sensitivity) declining marginally to 94.06%, while normal case accuracy improves to 94.61%.

This transformation occurs because SMOTE creates synthetic samples of the minority class, giving the algorithm more balanced training data. The result is a classification approach that shows improved performance in correctly identifying normal cases at the expense of a slight decrease in abnormal case detection. The fundamental trade-off revealed in these matrices is that SMOTE helps reduce the potentially disruptive “false-abnormal” errors (false positives) at the expense of a slight increase in “false-normal” classifications (false negatives).

For many applications where minimizing false alarms is important for operational efficiency, this recalibration represents a worthwhile improvement, though practitioners must carefully consider whether the slight decrease in abnormal detection sensitivity is acceptable for their specific use case. In scenarios where missing an abnormal case carries significant consequences, alternative approaches might be preferable.

3.3. Convergence—Learning Curves

Learning curves are graphical tools that monitor a classification model’s diagnostic performance during training by plotting accuracy (or loss) metrics against increasing training data. In clinical ML applications like retinal OCT analysis, these curves compare a model’s evolving performance on training data (its learning progress) against validation data (its generalizability to new cases), visually depicting whether the algorithm improves with experience or becomes overfit to specific training patterns. By tracking how learning stabilizes across retinal pathologies and healthy scans, these curves help ensure models achieve reliable, consistent diagnostic capabilities. Learning curves are critical for confirming that reported performance metrics reflect clinically trustworthy decision-making.

Figure 3 presents learning curves for an ML model developed for the classification of retinal OCT images into normal and abnormal categories. Four distinct curves are displayed: training and validation accuracies both with and without the application of SMOTE (Equation (9)). Both sets of curves exhibit a pattern where the training and validation accuracies converge toward similar values and then plateau together, which is a hallmark of well-trained ML models that have reached an optimal balance in their learning progression. The observed gap between training and validation curves represents the model’s generalization capability, i.e., a natural phenomenon where models typically perform better on training data than on unseen validation data. This gap stems from the fundamental challenge of generalization in ML and reflects the balance between underfitting and overfitting. Notably, the implementation of SMOTE results in a reduced gap between training and validation accuracies compared to the non-SMOTE approach. This narrower separation suggests enhanced generalization capabilities and reduced overfitting when SMOTE is applied. The improvement indicates that SMOTE effectively addresses class imbalance issues in the dataset by generating synthetic examples of minority classes, which is particularly valuable in medical imaging applications such as retinal OCT classification where pathological conditions may be underrepresented in training data. The learning curves thus demonstrate that SMOTE contributes to a more robust and reliable model for distinguishing between normal and abnormal retinal conditions.

3.4. Convergence—ROC Curves

Receiver operating characteristic (ROC) curves represent a fundamental visualization method in ML evaluation. These curves plot the true-positive rate (sensitivity) on the y-axis against the false-positive rate (1-specificity) on the x-axis across different classification thresholds. This visualization encapsulates the fundamental trade-off between sensitivity, which measures the ability to correctly identify abnormal cases such as diseased retinas, and specificity, which quantifies the ability to correctly identify normal cases such as healthy retinas.

Multiple ROC curves with corresponding area under the curve (AUC) values are shown in Figure 4. The ROC of Class 1 demonstrates an AUC of 0.98, while the ROC of Class 0 similarly achieves an AUC of 0.98. The microaverage ROC exhibits the highest performance with an AUC of 0.99, and the macroaverage ROC maintains consistency with an AUC of 0.98. These AUC values, approaching the maximum value of 1.0, indicate high discriminative ability of the developed ML models in classifying OCT images. The visual confirmation of this high performance is evident in the curves’ steep ascent near the y-axis and their proximity to the upper-left corner of the plot.

In the OCT classification context of this study, ROC curves serve multiple critical functions. First, as shown in Equation (8), AUC is utilized as part of the weighted scoring function (

ψ

) to select the optimal model (

m^{*}

) through 5-fold cross-validation. Second, sensitivity over specificity (

β

= 2 in their weighting scheme) is prioritized to minimize false negatives, which is particularly crucial in disease detection scenarios where missing abnormal cases carries higher clinical risk. Third, the high AUC values (0.98–0.99) quantify the models’ ability to distinguish between normal retinas and specific pathologies targeted in this study. Fourth, ROC curves allow the selection of optimal decision thresholds that balance sensitivity and specificity according to clinical priorities.

3.5. Clinical Utility

The clinical utility framework (Equation (17)) quantitatively evaluates the translational value of OCT classification models by harmonizing diagnostic accuracy with clinical decision priorities. Table 2 demonstrates this integration through weighted combinations of positive predictive value (PPV), negative predictive value (NPV), and F1 scores, governed by the coefficients

ω_{1}

= 0.1,

ω_{2}

= 0.6, and

ω_{3}

= 0.3. These weights were deliberately calibrated to address two critical clinical imperatives: (1) NPV—centric emphasis (

ω_{2}

= 0.6): The heightened weighting of NPV reflects the clinical necessity to minimize false-negative diagnoses in medical pathology detection. For conditions like diabetic macular edema or choroidal neovascularization, missed diagnoses (false negatives) carry greater patient risk than false alarms, as delayed treatment can lead to irreversible vision loss. The

ω_{2}

= 0.6 allocation ensures predictive models prioritize accurate exclusion of disease, reducing unnecessary follow-up burdens while safeguarding against underdiagnosis. (2) PPV and F1 balance (

ω_{1}

= 0.1,

ω_{3}

= 0.3): The smaller but non-zero weights for PPV (

ω_{1}

= 0.1) and F1 (

ω_{3}

= 0.3) maintain equilibrium between confirming true pathologies (PPV) and balancing precision/recall trade-offs (F1). This configuration acknowledges that while ruling out disease is paramount, definitive identification of true positives remains essential for initiating targeted therapies.

Table 2 illustrates the operationalization of these priorities. The unoptimized model achieves a total clinical utility of 0.937 (vs. 0.934 with optimization), driven by NPV’s dominant contribution (0.570 vs. 0.564 with optimization). Despite PPV’s lower weight, its retained inclusion ensures clinicians receive actionable confirmatory signals when pathologies are detected—a safeguard against over-reliance on NPV alone.

The emphasis on NPV (

ω_{2}

= 0.6) in our clinical utility framework reflects the critical importance of minimizing false negatives in sight-threatening retinal conditions such as choroidal neovascularization or diabetic macular edema, where delayed treatment can lead to irreversible vision loss. For pathologies where false negatives are less critical (e.g., benign or slowly progressing conditions), this weighting configuration maintains clinical value through increased reliability in ruling out disease but creates a specific performance profile. Our analysis demonstrates that while the SMOTE-optimized model achieves 94.06% sensitivity with NPV emphasis, it produces a slightly higher false-positive rate (5.39%) compared to alternative weighting schemes. Adjusting these weights for different pathological contexts offers performance customization: reducing NPV weight for benign conditions would improve specificity and reduce unnecessary follow-ups, increasing PPV weight would optimize confirmatory diagnostics rather than screening functions, and equalizing weights would provide balanced performance across error types. This flexibility enables context-specific optimization, as demonstrated by our cross-pathology testing results (95.09% sensitivity across untrained pathologies), suggesting the current weighting generalizes effectively while maintaining sufficient adaptability for diverse clinical scenarios based on specific risk–benefit considerations.

Succinctly speaking, this weighting schema aligns with evidence-based ophthalmological practice where diagnostic certainty in excluding disease (NPV) often supersedes confirmatory precision. By explicitly parameterizing these trade-offs, the framework bridges statistical performance and clinical pragmatism, ensuring models prioritize outcomes most consequential to patient care.

3.6. Cross-Pathology Generalization Testing

To test the model’s ability to generalize beyond its training domain, the trained ML framework was further evaluated on retinal pathologies explicitly excluded from the original training set (age-related macular degeneration, central serous retinopathy, and macular hole), assessing its capacity to detect anatomical disruptions despite lacking prior disease-specific exposure.

Testing the model on anatomically analogous but etiologically distinct pathologies is justified through four complementary mechanisms. The VGG16 biomarker extractor learns cross-pathology retinal disruption patterns from training classes (CNV/DME/DRUSEN), enabling detection of shared morphological anomalies like subretinal fluid accumulations in CSR and structural voids in MH. Consistent OCT imaging physics ensures domain invariance through standardized representation of pathophysiological signatures, allowing the model to focus on fundamental tissue properties rather than disease-specific artifacts.

Figure 5 presents a confusion matrix evaluating the SMOTE-optimized model’s performance on other OCT pathologies (DR, MH, CSR) on which it was not originally trained and validated, within a new 4000-image test set. The matrix exhibits strong diagnostic discrimination with 2206 true-normal classifications representing 95.05% specificity and 1997 true-abnormal detections achieving 95.09% sensitivity. Classification errors remain clinically acceptable at 4.95% false-abnormal (115 instances) and 4.91% false-normal (103 cases), indicating balanced performance across both healthy and pathological cohorts. The symmetrical error distribution suggests equivalent clinical risk between over-referral of normal cases and underdetection of disease.

Table 3 evaluates the trained model’s performance and cross-pathology testing results, emphasizing clinical utility and diagnostic reliability. The clinical utility value is calculated using Equation (17), which prioritizes metrics like NPV and F1 score to align with clinical decision-making priorities. The metrics show improvements through nine key metrics, demonstrating enhanced generalizability despite testing on etiologically distinct pathologies. The model achieves a 1.8% relative accuracy gain (0.9420 to 0.9507) alongside a 3.55% absolute increase in clinical utility (0.9163 to 0.9488), underscoring its robust adaptation to unseen conditions. Specificity improves from 0.9460 to 0.9519 while maintaining recall above 0.95, confirming preserved sensitivity to novel pathological features. The composite score rises to 0.9656 through optimized weighting of discriminative metrics, reflecting enhanced equilibrium between diagnostic precision and clinical decision priorities. This multidimensional validation approach confirms the model’s capacity to navigate n-dimensional biomarker manifolds rather than memorizing class-specific features.

4. Discussion

4.1. Learning Curve Dynamics and Model Generalization

While strong performance metrics can sometimes create a false impression of clinical utility, verifying the dynamics of learning curves is important for establishing the trustworthiness of a model. A well-trained model will show validation and training curves that progressively align, maintaining a consistent generalization gap as it converges. In this analysis, both the SMOTE-optimized and non-SMOTE approaches demonstrated this behavior, with a smaller gap between training and validation trajectories in the SMOTE method, indicating improved generalization. Healthy learning patterns are characterized by validation accuracy that steadily approaches training accuracy, avoiding erratic divergence or complete overlap. A smaller gap is preferable to a larger one; however, a complete overlap can obscure overfitting to both datasets, while an expanding gap suggests poor generalization. Importantly, inverted curves, where validation accuracy exceeds training accuracy, may signal issues such as contamination of the validation set or biased sampling, which can compromise reliability. For OCT classification, these dynamics help confirm whether the features learned truly capture pathological signatures rather than artifacts from the dataset. The smaller gaps observed with SMOTE indicate a better handling of class imbalance, while unstable patterns could render even high numerical metrics clinically questionable. Therefore, validating learning curves should always be a must for distinguishing robust biomarkers from misleading correlations.

4.2. Hybrid Methodology: Deep Learning and Traditional ML Classifiers

This study deliberately employed a hybrid methodology combining deep learning feature extraction (VGG16 pretrained networks) with traditional ML classifiers rather than implementing end-to-end deep neural architectures. This hybrid approach aligns with successful previous work in DME classification where traditional ML classifiers demonstrated strong performance when combined with carefully engineered features [17]. However, this approach involves certain trade-offs. While potentially limiting the model’s ability to learn hierarchical OCT-specific features that might emerge from end-to-end training, it offers significant advantages for our specific binary classification task. The pretrained VGG16 feature extraction combined with traditional ML classifiers provided superior interpretability, required less training data, and enabled precise calibration of sensitivity–specificity trade-offs through transparent threshold adjustments—critical factors for clinical adoption. Logistic Regression demonstrated particularly strong performance (accuracy: 0.9396, AUC: 0.9832) compared to more complex alternatives, offering an optimal balance between computational efficiency and classification power while requiring significantly fewer resources than full deep learning implementations. Our comparative analysis confirmed that this hybrid approach achieved equivalent discrimination power with enhanced interpretability compared to more complex alternatives. However, we acknowledge that for future multiclass pathology classification tasks, where increased feature complexity and inter-class relationships are more nuanced, full deep learning implementations may become necessary to capture hierarchical relationships across diverse OCT presentations beyond the binary classification presented in this initial study.

4.3. ROC Curves and Clinical Implications

The ROC curves (Figure 4) complement the confusion matrices shown in Figure 2, which compare classification results with and without SMOTE optimization. While confusion matrices show performance at a specific threshold, ROC curves evaluate performance across all possible thresholds. The improved detection of abnormal cases after SMOTE optimization, as evidenced by the reduced “false-normal” rate from 7.34% to 5.39%, is reflected in the high AUC values. This demonstrates how the model was calibrated to prioritize sensitivity in detecting pathological conditions.

For clinicians interpreting OCT images, these ROC curves demonstrate that the developed ML models provide highly reliable classification with minimal risk of missing pathological cases. The AUC of 0.98–0.99 suggests that if presented with one normal and one abnormal OCT image, the model would correctly rank the abnormal image with higher probability approximately 98–99% of the time. This clinical model optimization approach, detailed in Equation (12), further underscores how overall discriminative performance (i.e., AUC) is balanced with the specific clinical need to minimize false negatives in retinal disease detection.

4.4. Clinical Utility Metrics and Ethical Considerations

The integration of clinical utility metrics into ML model evaluation represents a critical shift from purely statistical validation to clinically grounded assessment. While traditional performance measures like accuracy and AUC provide essential insights into model discrimination, they often fail to capture the nuanced trade-offs inherent in medical decision-making. Our

ω

-weighted clinical utility framework (Equation (17)) explicitly bridges this gap by aligning model outputs with domain-specific priorities, particularly the imperative to minimize false-negative diagnoses in sight-threatening conditions like AMD and DME, where delayed intervention carries irreversible consequences. By disproportionately weighting NPV (

ω_{2}

= 0.6), the framework ensures that models prioritize ruling out disease with high confidence, mirroring the “first, do no harm” ethos of clinical practice. This approach acknowledges that a marginally lower AUC becomes clinically acceptable if accompanied by a significant reduction in missed pathologies, as evidenced by the optimized model’s utility score improvement (Table 2). Our emphasis on clinical relevance builds upon previous work in automated OCT analysis for both AMD [18] and DME [17] that demonstrated the importance of aligning computational metrics with clinical priorities.

4.5. SMOTE Optimization and Real-World Reliability

While the clinical utility scores for models with and without SMOTE optimization appear numerically comparable (0.9237 vs. 0.9163), this superficial similarity masks critical differences in their real-world reliability. Closer analysis reveals that the SMOTE-optimized model achieves its clinical advantage through superior training dynamics: learning curves demonstrate narrower generalization gaps between training and validation accuracies, a hallmark of robust models primed for deployment. Though values nearing 1 (indicating alignment with priorities like minimizing false negatives) are essential, stability outweighs marginal metric gains in clinical settings. The SMOTE-enhanced model’s reduced overfitting ensures consistent performance across heterogeneous patient populations—a decisive factor where diagnostic reliability trumps isolated numerical superiority. Thus, while both models satisfy baseline clinical utility thresholds, the SMOTE approach justifies slight trade-offs in metrics by delivering generalizable robustness, a prerequisite for trustworthy deployment in variable real-world environments.

4.6. Ethical Obligations in Translational ML Frameworks

Such metrics counterbalance the seductive allure of high-performance benchmarks that may mask clinically hazardous error distributions, as seen in models favoring PPV-dominated scores that risk undertreatment. In an era where ML increasingly informs diagnostic workflows, translational frameworks like ours, which tether algorithmic outputs to patient-centered outcomes, are not merely advantageous but ethically obligatory. They transform ML validation from an academic exercise into a fidelity test for real-world clinical viability, ensuring that models advance beyond computational virtuosity to become trustworthy partners in patient care.

4.7. Cross-Pathology Generalization

Cross-pathology testing demonstrated robust generalization to anatomically distinct retinal pathologies (AMD, CSR, DR, MH) excluded from training, achieving 95.09% sensitivity and 95.05% specificity (Figure 5). This suggests that the extracted VGG16-derived biomarkers captured latent structural disruption patterns (e.g., fluid accumulation, retinal layer discontinuities) transferable across etiologically diverse conditions, reducing dependency on disease-specific training data.

To establish the proposed framework’s clinical viability, a three-tiered validation approach is essential for deployment across varied healthcare settings. First, multicenter external validation using heterogeneous OCT datasets from diverse equipment manufacturers and patient demographics would verify generalizability beyond research environments. Domain adaptation techniques would address device-specific artifacts, while stratified sampling would ensure representation across disease severity spectra. Second, performance benchmarking against a demographically diverse test set including rare pathological variants would establish sensitivity boundaries for unusual presentations. Third, prospective implementation through a clinical decision support interface would enable the real-time evaluation of diagnostic concordance between model predictions and specialist interpretations. For optimal real-world performance, the framework would require periodic recalibration through targeted fine-tuning on challenging cases, integration of uncertainty quantification to flag low-confidence predictions for expert review, and implementation of explainability mechanisms to facilitate clinician trust. Our cross-pathology validation approach provides preliminary evidence for the framework’s adaptability to novel retinal conditions, suggesting promising generalization capabilities for deployment across diverse clinical environments where standardized interpretation remains challenging.

4.8. Clinical Workflow Integration

Successful clinical implementation of the proposed OCT classification framework requires thoughtful integration into diverse ophthalmological workflows. In high-volume screening settings, the model could function as an initial triage mechanism, automatically identifying scans that require urgent specialist attention while providing standardized reporting for normal findings. This approach has the potential to significantly reduce the time specialists spend reviewing cases. For referral management, integrating with existing electronic referral systems would enable automated preassessment prioritization based on pathology probability scores. Implementation challenges include variable OCT hardware specifications across clinical environments, necessitating vendor-specific calibration protocols and domain adaptation techniques to maintain consistent performance. Addressing these challenges requires a phased deployment approach: (1) a shadow-mode validation phase where AI predictions are generated but not displayed to clinicians; (2) a supervised implementation phase with AI predictions displayed alongside confidence metrics but requiring expert confirmation; and (3) a selective automation phase where high-confidence normal classifications might bypass manual review in appropriate risk-stratified scenarios. This graduated approach balances the benefits of workflow efficiency against the imperative of maintaining diagnostic safety while allowing for continuous model refinement through feedback loops from clinical outcomes.

4.9. Study Limitations and Future Directions

Despite the results, several limitations of this study warrant discussion to provide a balanced assessment of the framework’s clinical applicability. First, data limitations include the reliance on retrospective publicly available datasets, which may not fully represent the diverse spectrum of retinal pathologies encountered in clinical practice, particularly in terms of disease severity gradations and atypical presentations. The datasets also lack comprehensive demographic diversity information, limiting our ability to assess potential biases across age groups, ethnicities, and comorbidity profiles. Second, while our cross-pathology testing demonstrated encouraging generalization, the model’s performance in more complex clinical scenarios (such as cases with multiple concurrent pathologies, media opacities, or poor image quality) remains unverified. Third, the current implementation employs a binary classification approach (normal vs. abnormal), which, while clinically valuable as a screening tool, does not achieve the granular multiclass differentiation that would be required for comprehensive diagnostic assistance. Fourth, our framework utilizes transfer learning from VGG16, a somewhat dated architecture that carries increased computational overhead compared to more recent models; this represents a trade-off between established feature extraction capability and efficiency. Finally, despite our emphasis on clinical utility metrics, this study lacks prospective validation in real clinical workflows, where integration challenges and impact on clinical decision-making would become apparent. Future work should address these limitations through multicenter prospective studies with demographically diverse cohorts, expansion to multiclass classification capability, evaluation on cases with coexisting pathologies, and direct comparison against clinical specialists to establish non-inferiority benchmarks for clinical implementation.

5. Conclusions

This study demonstrates that clinical utility-driven ML enhances retinal OCT classification by harmonizing technical performance with diagnostic priorities. Logistic Regression emerged as the optimal model, balancing high accuracy and discriminative power while reducing false-negative rates through SMOTE-optimized data stratification. The framework’s emphasis on NPV addresses critical clinical needs in sight-threatening conditions like AMD and DME, where delayed diagnosis risks irreversible vision loss. Learning curves and ROC validation confirmed stable generalization, with SMOTE reducing the training–validation accuracy gap. By integrating anatomical feature preservation, automated model selection, and translational metrics, this approach standardizes OCT analysis. The results underscore the viability of ML-driven OCT diagnostics in reducing interpretative variability, accelerating pathological biomarker detection, and improving clinical decision-making, with the hope of deployable clinically relevant AI tools in ophthalmic practice. Future work should focus on multicenter validation and real-time integration with clinical workflows to maximize patient-centered impact.

Cross-validation on pathologies absent from training validated the framework’s adaptability to novel retinal conditions, supporting its clinical viability in detecting unforeseen pathologies through shared microstructural signatures. This mitigates a key adoption barrier in ophthalmology, where exhaustive training on rare diseases is impractical.

Author Contributions

Conceptualization, M.S., R.S., D.R., D.N. (Daniel Nasef), D.N. (Demarcus Nasef) and M.T.; methodology, M.S., R.S., D.R., D.N. (Daniel Nasef), D.N. (Demarcus Nasef) and M.T.; software, M.S., R.S., D.R., D.N. (Daniel Nasef), D.N. (Demarcus Nasef) and M.T.; validation, M.S., R.S., D.R., D.N. (Daniel Nasef), D.N. (Demarcus Nasef) and M.T.; formal analysis, M.S., R.S., D.R., D.N. (Daniel Nasef), D.N. (Demarcus Nasef) and M.T.; investigation, M.S., R.S., D.R., D.N. (Daniel Nasef), D.N. (Demarcus Nasef) and M.T.; resources, M.S., R.S., D.R., D.N. (Daniel Nasef), D.N. (Demarcus Nasef) and M.T.; data curation, M.S., R.S., D.R., D.N. (Daniel Nasef), D.N. (Demarcus Nasef) and M.T.; writing—original draft preparation, M.S., R.S., D.R., D.N. (Daniel Nasef), D.N. (Demarcus Nasef) and M.T.; writing—review and editing, M.S., R.S., D.R., D.N. (Daniel Nasef), D.N. (Demarcus Nasef) and M.T.; visualization, M.T.; supervision, M.T.; project administration, M.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article are available in public repositories.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AMD	Age-related macular degeneration
AUC	Area under the curve
CNV	Choroidal neovascularization
CNN	Convolutional neural network
CSR	Central serous retinopathy
DME	Diabetic macular edema
DR	Diabetic retinopathy
KNN	K Neighbors Classifier
LDA	Linear Discriminant Analysis
LR	Logistic Regression
MH	Macular hole
ML	Machine learning
NPV	Negative predictive value
OCT	Optical Coherence Tomography
PPV	Positive predictive value
QDA	Quadratic Discriminant Analysis
ROC	Receiver operating characteristic
SHAP	SHapley Additive exPlanations
SMOTE	Synthetic Minority Oversampling Technique
SVM	Support vector machine

References

Huang, D.; Swanson, E.A.; Lin, C.P.; Schuman, J.S.; Stinson, W.G.; Chang, W.; Hee, M.R.; Flotte, T.; Gregory, K.; Puliafito, C.A.; et al. Optical Coherence Tomography. Science 1991, 254, 1178–1181. [Google Scholar] [CrossRef]
Wojtkowski, M.; Bajraszewski, T.; Targowski, P.; Kowalczyk, A. Real-time in vivo imaging by high-speed spectral optical coherence tomography. Opt. Lett. 2003, 28, 1745. [Google Scholar] [CrossRef]
Bezerra, H.G.; Costa, M.A.; Guagliumi, G.; Rollins, A.M.; Simon, D.I. Intracoronary Optical Coherence Tomography: A Comprehensive Review. JACC Cardiovasc. Interv. 2009, 2, 1035–1046. [Google Scholar] [CrossRef]
Tsai, T.H.; Leggett, C.L.; Trindade, A.J. Optical coherence tomography in gastroenterology: A review and future outlook. J. Biomed. Opt. 2017, 22, 121716. [Google Scholar] [CrossRef] [PubMed]
Everett, M.; Magazzeni, S.; Schmoll, T.; Kempe, M. Optical coherence tomography: From technology to applications in ophthalmology. Transl. Biophotonics 2020, 3, e202000012. [Google Scholar] [CrossRef]
Ţalu, S.D. Optical Coherence Tomography in the Diagnosis and Monitoring of Retinal Diseases. ISRN Biomed. Imaging 2013, 2013, 910641. [Google Scholar] [CrossRef]
Schmitz-Valckenberg, S.; Brinkmann, C.K.; Heimes, B.; Liakopoulos, S.; Spital, G.; Holz, F.G.; Fleckenstein, M. Pitfalls in Retinal OCT Imaging. Ophthalmol. Point Care 2017, 1, e0000024. [Google Scholar] [CrossRef]
Husain, G.; Mayer, J.; Bekbolatova, M.; Vathappallil, P.; Matalia, M.; Toma, M. Machine learning for medical image classification. Acad. Med. 2024, 1, 4. [Google Scholar] [CrossRef]
Lee, C.S.; Baughman, D.M.; Lee, A.Y. Deep Learning Is Effective for Classifying Normal versus Age-Related Macular Degeneration OCT Images. Ophthalmol. Retin. 2017, 1, 322–327. [Google Scholar] [CrossRef]
Bogunovic, H.; Montuoro, A.; Baratsits, M.; Karantonis, M.G.; Waldstein, S.M.; Schlanitz, F.; Schmidt-Erfurth, U. Machine Learning of the Progression of Intermediate Age-Related Macular Degeneration Based on OCT Imaging. Investig. Opthalmolo. Vis. Sci. 2017, 58, BIO141. [Google Scholar] [CrossRef]
Toma, M.; Husain, G. Algorithm Selection and Data Utilization in Machine Learning for Medical Imaging Classification. In Proceedings of the 2024 IEEE Long Island Systems, Applications and Technology Conference (LISAT), Holtsville, NY, USA, 15 November 2024; pp. 1–6. [Google Scholar] [CrossRef]
Abd El-Khalek, A.A.; Balaha, H.M.; Sewelam, A.; Ghazal, M.; Khalil, A.T.; Abo-Elsoud, M.E.A.; El-Baz, A. A Comprehensive Review of AI Diagnosis Strategies for Age-Related Macular Degeneration (AMD). Bioengineering 2024, 11, 711. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.Y.; Chen, M.; Ishikawa, H.; Wollstein, G.; Schuman, J.S.; Rehg, J.M. Automated macular pathology diagnosis in retinal OCT images using multi-scale spatial pyramid and local binary patterns in texture and shape encoding. Med. Image Anal. 2011, 15, 748–759. [Google Scholar] [CrossRef]
Srinivasan, P.P.; Kim, L.A.; Mettu, P.S.; Cousins, S.W.; Comer, G.M.; Izatt, J.A.; Farsiu, S. Fully automated detection of diabetic macular edema and dry age-related macular degeneration from optical coherence tomography images. Biomed. Opt. Express 2014, 5, 3568. [Google Scholar] [CrossRef]
Albarrak, A.; Coenen, F.; Zheng, Y. Dictionary Learning-Based Volumetric Image Classification for the Diagnosis of Age-Related Macular Degeneration. In Machine Learning and Data Mining in Pattern Recognition; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 272–284. [Google Scholar] [CrossRef]
Farsiu, S.; Chiu, S.J.; O’Connell, R.V.; Folgar, F.A.; Yuan, E.; Izatt, J.A.; Toth, C.A. Quantitative Classification of Eyes with and without Intermediate Age-related Macular Degeneration Using Optical Coherence Tomography. Ophthalmology 2014, 121, 162–172. [Google Scholar] [CrossRef] [PubMed]
Alsaih, K.; Lemaitre, G.; Rastgoo, M.; Massich, J.; Sidibé, D.; Meriaudeau, F. Machine learning techniques for diabetic macular edema (DME) classification on SD-OCT images. Biomed. Eng. Online 2017, 16, 68. [Google Scholar] [CrossRef] [PubMed]
Pang, S.; Zou, B.; Xiao, X.; Peng, Q.; Yan, J.; Zhang, W.; Yue, K. A novel approach for automatic classification of macular degeneration OCT images. Sci. Rep. 2024, 14, 19285. [Google Scholar] [CrossRef]
Arikan, M.; Willoughby, J.; Ongun, S.; Sallo, F.; Montesel, A.; Ahmed, H.; Hagag, A.; Book, M.; Faatz, H.; Cicinelli, M.V.; et al. OCT5k: A dataset of multi-disease and multi-graded annotations for retinal layers. Sci. Data 2025, 12, 267. [Google Scholar] [CrossRef]
Deng, X.; Chen, D.; Liu, B.; Zhang, X.; Qiu, H.; Yuan, W.; Ren, H. Fine-grained Classification Reveals Angiopathological Heterogeneity of Port Wine Stains Using OCT and OCTA Features. IEEE J. Biomed. Health Inform. 2025, 1–12. [Google Scholar] [CrossRef]
Sreng, S.; Ramesh, P.; Nam Phuong, P.D.; Binte Abdul Gani, N.F.; Chua, J.; Nongpiur, M.E.; Aung, T.; Husain, R.; Schmetterer, L.; Wong, D. Wide-field OCT volumetric segmentation using semi-supervised CNN and transformer integration. Sci. Rep. 2025, 15, 6676. [Google Scholar] [CrossRef]
Xu, X.; Wang, H.; Lu, Y.; Zhang, H.; Tan, T.; Xu, F.; Lei, J. Joint segmentation of retinal layers and fluid lesions in optical coherence tomography with cross-dataset learning. Artif. Intell. Med. 2025, 162, 103096. [Google Scholar] [CrossRef]
Li, Y.; Yu, B.; Si, M.; Yang, M.; Cui, W.; Zhou, Y.; Fu, S.; Wang, H.; Liu, X.; Zhang, H. Enhancing diabetic retinopathy diagnosis: Automatic segmentation of hyperreflective foci in OCT via deep learning. Int. Ophthalmol. 2025, 45, 79. [Google Scholar] [CrossRef] [PubMed]
Teegavarapu, R.R.; Sanghvi, H.A.; Belani, T.; Gill, G.S.; Chalam, K.; Gupta, S. Evaluation of Convolutional Neural Networks (CNNs) in Identifying Retinal Conditions Through Classification of Optical Coherence Tomography (OCT) Images. Cureus 2025, 17, e77109. [Google Scholar] [CrossRef] [PubMed]
Kapetanaki, M.V.; Maliagkani, E.; Tyrlis, K.; Georgalas, I. Artificial Intelligence in Myopic Maculopathy: A Comprehensive Review of Identification, Classification, and Monitoring Using Diverse Imaging Modalities. Cureus 2025, 17, e78685. [Google Scholar] [CrossRef] [PubMed]
Assaf, J.F.; Yazbeck, H.; Reinstein, D.Z.; Archer, T.J.; Assaf, R.; de Ortueta, D.; Arbelaez, J.; Arbelaez, M.C.; Awwad, S.T. Automated Detection of Keratorefractive Laser Surgeries on Optical Coherence Tomography Using Deep Learning. J. Refract. Surg. 2025, 41, e248–e256. [Google Scholar] [CrossRef]
Kim, H.K.; Yoo, T.K. Oculomics approaches using retinal imaging to predict mental health disorders: A systematic review and meta-analysis. Int. Ophthalmol. 2025, 45, 111. [Google Scholar] [CrossRef]
Birner, K.; Reiter, G.S.; Steiner, I.; Zarghami, A.; Sadeghipour, A.; Schürer-Waldheim, S.; Gumpinger, M.; Bogunović, H.; Schmidt-Erfurth, U. Structure-Function Correlation of Deep-Learning Quantified Ellipsoid Zone and Retinal Pigment Epithelium Loss and Microperimetry in Geographic Atrophy. Investig. Ophthalmol. Vis. Sci. 2025, 66, 26. [Google Scholar] [CrossRef]
Kar, S.S.; Cetin, H.; Srivastava, S.K.; Madabhushi, A.; Ehlers, J.P. Stable and discriminating OCT-derived radiomics features for predicting anti-VEGF treatment response in diabetic macular edema. Med. Phys. 2025. ahead of print. [Google Scholar] [CrossRef]
Smith, R.; Sapkota, R.; Antony, B.; Sun, J.; Aboud, O.; Bloch, O.; Daly, M.; Fragoso, R.; Yiu, G.; Liu, Y.A. A novel predictive model utilizing retinal microstructural features for estimating survival outcome in patients with glioblastoma. Clin. Neurol. Neurosurg. 2025, 250, 108790. [Google Scholar] [CrossRef]
Pham, A.T.; Bradley, C.; Hou, K.; Herbert, P.; Yohannan, J. Detecting glaucoma worsening using optical coherence tomography derived visual field estimates. Sci. Rep. 2025, 15, 5013. [Google Scholar] [CrossRef]
Hasan, M.M.; Phu, J.; Wang, H.; Sowmya, A.; Meijering, E.; Kalloniatis, M. Predicting visual field global and local parameters from OCT measurements using explainable machine learning. Sci. Rep. 2025, 15, 5685. [Google Scholar] [CrossRef]
Gim, N.; Ferguson, A.; Blazes, M.; Soundarajan, S.; Gasimova, A.; Jiang, Y.; Gutiérrez, C.S.; Zalunardo, L.; Corradetti, G.; Elze, T.; et al. Publicly Available Imaging Datasets for Age-related Macular Degeneration: Evaluation according to the Findable, Accessible, Interoperable, Reusable (FAIR) Principles. Exp. Eye Res. 2025, 255, 110342. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Wang, Y.X.; Zeng, D.; Zhu, Z.; Li, D.; Liu, Y.; Sheng, B.; Grzybowski, A.; Wong, T.Y. Artificial intelligence-enhanced retinal imaging as a biomarker for systemic diseases. Theranostics 2025, 15, 3223–3233. [Google Scholar] [CrossRef]
Saetiew, J.; Nongkhunsan, P.; Saenjae, J.; Yodsungnoen, R.; Molee, A.; Jungthawan, S.; Fongkaew, I.; Meemon, P. Automated chick gender determination using optical coherence tomography and deep learning. Poult. Sci. 2025, 104, 105033. [Google Scholar] [CrossRef] [PubMed]
Remtulla, R.; Samet, A.; Kulbay, M.; Akdag, A.; Hocini, A.; Volniansky, A.; Kahn Ali, S.; Qian, C.X. A Future Picture: A Review of Current Generative Adversarial Neural Networks in Vitreoretinal Pathologies and Their Future Potentials. Biomedicines 2025, 13, 284. [Google Scholar] [CrossRef] [PubMed]
Umezu, K.; Larina, I.V. Optical coherence tomography for dynamic investigation of mammalian reproductive processes. Mol. Reprod. Dev. 2022, 90, 3–13. [Google Scholar] [CrossRef]
Ortiz, M.; Pueyo, A.; Dongil, F.J.; Boquete, L.; Sánchez-Morla, E.M.; Barea, R.; Miguel-Jimenez, J.M.; López-Dorado, A.; Vilades, E.; Rodrigo, M.J.; et al. New Method of Early RRMS Diagnosis Using OCT-Assessed Structural Retinal Data and Explainable Artificial Intelligence. Transl. Vis. Sci. Technol. 2025, 14, 14. [Google Scholar] [CrossRef]
Naren, O.S. Retinal OCT Image Classification—C8. 2021. Available online: https://www.kaggle.com/datasets/obulisainaren/retinal-oct-c8 (accessed on 23 April 2025).
Kermany, D.S.; Goldbaum, M.; Cai, W.; Valentim, C.C.; Liang, H.; Baxter, S.L.; McKeown, A.; Yang, G.; Wu, X.; Yan, F.; et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell 2018, 172, 1122–1131.e9. [Google Scholar] [CrossRef]
Husain, G.; Nasef, D.; Jose, R.; Mayer, J.; Bekbolatova, M.; Devine, T.; Toma, M. SMOTE vs. SMOTEENN: A Study on the Performance of Resampling Algorithms for Addressing Class Imbalance in Regression Models. Algorithms 2025, 18, 37. [Google Scholar] [CrossRef]

Figure 1. Integration and partitioning methodology for heterogeneous retinal OCT datasets. Dataset 1 (24k images with CNV, DME, DRUSEN, AMD, CSR, DR, MH) and Dataset 2 (84.5k images with CNV, DME, DRUSEN) are merged based on shared classes (CNV, DME, DRUSEN, NORMAL). The combined dataset is proportionally stratified into training (70%), validation (15%), and testing (15%) subsets to preserve biomarker distributions. Cross-pathology evaluation is performed on Dataset 1’s exclusive pathologies (AMD, CSR, DR, MH), which are entirely excluded from training data to assess model generalizability and biomarker transferability to anatomically distinct conditions.

Figure 2. Confusion matrices comparing classification results from a 3-way split (train/validate/test) dataset with and without SMOTE: The left matrix (without SMOTE) shows 7.34% false-abnormal classifications with 5.02% false-normal cases, while the right matrix (with SMOTE) demonstrates improved normal detection with reduced false-abnormal cases (5.39%) at the expense of slightly increased false-normal classifications (5.94%). The background colors in the matrices: pink indicates True Positives and True Negatives, while blue represents False Positives and False Negatives. (a) Without optimization. (b) With optimization.

Figure 3. Learning curves illustrating the training and validation loss over training instances. The curves demonstrate the model’s convergence during training and its generalization capability on unseen validation data.

Figure 4. Comparative ROC curves demonstrating diagnostic discrimination in retinal OCT classification; the figure shows ROC curves for abnormal (Class 1) and normal (Class 0) retinal OCT classifications, with AUC values of 0.98 for both classes. Microaveraged performance achieved near-perfect discrimination (AUC 0.99), while macroaveraging maintained robust multiclass separation (AUC 0.98). The steep trajectory toward the top-left quadrant reflects optimized sensitivity–specificity equilibrium, prioritizing minimal false negatives in sight-threatening conditions like AMD and DME. Curves confirm clinical calibration of models to balance early pathology detection (94.06% sensitivity) with reliable normal scan identification (94.61% specificity).

Figure 5. Confusion matrix evaluating the ML model’s performance on cross-pathology testing for retinal pathologies excluded from training (e.g., AMD, CSR, MH). The matrix demonstrates robust generalization with 95.09% sensitivity (1997/2100 abnormal cases correctly identified) and 95.05% specificity (2206/2321 normal scans accurately classified) while maintaining balanced error rates of 4.91% false-normal (103 misclassified pathologies) and 4.95% false-abnormal (115 normal cases flagged as abnormal). This validates clinically reliable detection of unseen retinal pathologies.

Table 1. Comparison of 14 ML classification models (all trained with SMOTE). The models are ranked in descending order of accuracy, with Logistic Regression (lr) achieving the highest accuracy of 0.9396. Each model is evaluated across multiple performance metrics, including accuracy, area under the curve (AUC), recall, precision, F1 score, Kappa coefficient, Matthews correlation coefficient (MCC), specificity, and a composite score calculated via Equation (12)’s weighted combination of AUC, sensitivity, and specificity.

Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	Specificity	Score
lr	0.9423 *	0.9832 *	0.9409	0.9725	0.9156 *	0.8660 *	0.8674 *	0.946 *	0.964 *
lightgbm	0.9394	0.9827	0.9337	0.8886 *	0.9106	0.8648	0.8654	0.942	0.960
rf	0.9317	0.9794	0.9103	0.8862	0.8980	0.8467	0.8469	0.942	0.951
et	0.9280	0.9786	0.9041	0.8812	0.8925	0.8384	0.8385	0.940	0.949
svm	0.9279	0.9811	0.9290	0.8679	0.8955	0.8407	0.8441	0.927	0.955
ridge	0.9249	0.9801	0.9614	0.8359	0.8942	0.8364	0.8413	0.907	0.960
lda	0.9249	0.9801	0.9614	0.8359	0.8942	0.8364	0.8413	0.907	0.960
gbc	0.9247	0.9757	0.9312	0.8542	0.8910	0.8337	0.8355	0.922	0.952
ada	0.9058	0.9640	0.8831	0.8401	0.8610	0.7898	0.7904	0.894	0.926
dt	0.8582	0.8445	0.8041	0.7754	0.7894	0.6826	0.6830	0.885	0.840
knn	0.8356	0.9475	0.9849 *	0.6712	0.7983	0.6678	0.7029	0.762	0.922
qda	0.8268	0.9168	0.8474	0.6951	0.7638	0.6291	0.6368	0.816	0.876
nb	0.7683	0.8331	0.8640	0.6045	0.7113	0.5277	0.5513	0.721	0.820
dummy	0.6696	0.5000	0.0000	0.0000	0.0000	0.0000	0.0000	1.000	0.450

* Highest value. The table includes the following ML algorithms: Logistic Regression (lr), Light Gradient Boosting Machine (lightgbm), Random Forest Classifier (rf), Extra Trees Classifier (et), SVM—Linear Kernel (svm), Ridge Classifier (ridge), Linear Discriminant Analysis (lda), Gradient Boosting Classifier (gbc), Ada Boost Classifier (ada), Decision Tree Classifier (dt), K Neighbors Classifier (knn), Quadratic Discriminant Analysis (qda), Naive Bayes (nb), and Dummy Classifier (dummy).

Table 2. Translational impact of diagnostic models through component-weighted clinical utility scores, emphasizing negative predictive value (NPV) prioritization (

ω_{2}

= 0.6) to minimize false-negative risks in retinal pathology detection. Columns show weighted contributions of positive predictive value (PPV), NPV, and F1 scores for unoptimized and SMOTE-optimized models, with total utility reflecting alignment to ophthalmological decision priorities.

Table 2. Translational impact of diagnostic models through component-weighted clinical utility scores, emphasizing negative predictive value (NPV) prioritization (

ω_{2}

= 0.6) to minimize false-negative risks in retinal pathology detection. Columns show weighted contributions of positive predictive value (PPV), NPV, and F1 scores for unoptimized and SMOTE-optimized models, with total utility reflecting alignment to ophthalmological decision priorities.

Metric	Without Optimization	With Optimization
Basic Metrics
True positives (TP)	11,648 (94.98%)	11,534 (94.06%)
True negatives (TN)	5607 (92.66%)	5725 (94.61%)
False positives (FP)	444 (7.34%)	326 (5.39%)
False negatives (FN)	615 (5.02%)	729 (5.94%)
Accuracy	0.942	0.942
Sensitivity/recall	0.950	0.941
Specificity	0.927	0.946
Precision/PPV	0.963	0.973
NPV	0.901	0.887
F1 score	0.956	0.956
Equation (12): Score $(m_{θ}) = α_{1} \cdot AUC + α_{2} \cdot Sensitivity + α_{3} \cdot Specificity$
AUC (from Figure 4)	0.98	0.98
$α_{1} \cdot AUC = 0.5 \cdot AUC$	0.490	0.490
$α_{2} \cdot Sensitivity = 0.3 \cdot Sensitivity$	0.285	0.282
$α_{3} \cdot Specificity = 0.2 \cdot Specificity$	0.185	0.189
Total score	0.960	0.961
Equation (17): Clinical Utility $= ω_{1} \cdot PPV + ω_{2} \cdot NPV + ω_{3} \cdot F 1$
$ω_{1} \cdot PPV = 0.1 \cdot PPV$	0.0963	0.0973
$ω_{2} \cdot NPV = 0.6 \cdot NPV$	0.5406	0.5322
$ω_{3} \cdot F 1 = 0.3 \cdot F 1$	0.2868	0.2868
Total clinical utility	0.9237	0.9163

Table 3. Comparison of metrics for trained model with optimization and cross-pathology testing. The table evaluates performance across multiple metrics, including accuracy, AUC, recall, precision, F1 score, Kappa coefficient, MCC, specificity, and a composite score calculated via weighted combinations of AUC, sensitivity, and specificity.

Metric	Accuracy	Recall	Prec.	F1	Kappa	MCC	Specificity	Score	Clinical Utility
ML optimized	0.9420	0.9410	0.9730	0.9560	0.8660	0.8674	0.9460	0.9610	0.9163
Cross-pathology testing	0.9507	0.9505	0.9554	0.9529	0.8700	0.8720	0.9519	0.9656	0.9488

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sher, M.; Sharma, R.; Remyes, D.; Nasef, D.; Nasef, D.; Toma, M. Stratified Multisource Optical Coherence Tomography Integration and Cross-Pathology Validation Framework for Automated Retinal Diagnostics. Appl. Sci. 2025, 15, 4985. https://doi.org/10.3390/app15094985

AMA Style

Sher M, Sharma R, Remyes D, Nasef D, Nasef D, Toma M. Stratified Multisource Optical Coherence Tomography Integration and Cross-Pathology Validation Framework for Automated Retinal Diagnostics. Applied Sciences. 2025; 15(9):4985. https://doi.org/10.3390/app15094985

Chicago/Turabian Style

Sher, Michael, Riah Sharma, David Remyes, Daniel Nasef, Demarcus Nasef, and Milan Toma. 2025. "Stratified Multisource Optical Coherence Tomography Integration and Cross-Pathology Validation Framework for Automated Retinal Diagnostics" Applied Sciences 15, no. 9: 4985. https://doi.org/10.3390/app15094985

APA Style

Sher, M., Sharma, R., Remyes, D., Nasef, D., Nasef, D., & Toma, M. (2025). Stratified Multisource Optical Coherence Tomography Integration and Cross-Pathology Validation Framework for Automated Retinal Diagnostics. Applied Sciences, 15(9), 4985. https://doi.org/10.3390/app15094985

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stratified Multisource Optical Coherence Tomography Integration and Cross-Pathology Validation Framework for Automated Retinal Diagnostics

Abstract

1. Introduction

2. Materials and Methods

2.1. Composite Dataset Curation

2.2. Clinical Data Curation

2.3. Deep Biomarker Extraction

2.4. Clinical Model Optimization

2.5. Clinical Implementation

3. Results

3.1. Comparative Model Performance

3.2. Confusion Matrices

3.3. Convergence—Learning Curves

3.4. Convergence—ROC Curves

3.5. Clinical Utility

3.6. Cross-Pathology Generalization Testing

4. Discussion

4.1. Learning Curve Dynamics and Model Generalization

4.2. Hybrid Methodology: Deep Learning and Traditional ML Classifiers

4.3. ROC Curves and Clinical Implications

4.4. Clinical Utility Metrics and Ethical Considerations

4.5. SMOTE Optimization and Real-World Reliability

4.6. Ethical Obligations in Translational ML Frameworks

4.7. Cross-Pathology Generalization

4.8. Clinical Workflow Integration

4.9. Study Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI