Next Article in Journal
How Long Does It Take to Stop? Are Children Able to Stop on Demand?
Previous Article in Journal
Three-Dimensional Real-Scene-Enhanced GNSS/Intelligent Vision Surface Deformation Monitoring System
Previous Article in Special Issue
Short-Term Risk Estimation and Treatment Planning for Cardiovascular Disease Patients after First Diagnostic Catheterizations with Machine Learning Models
 
 
Article
Peer-Review Record

Stratified Multisource Optical Coherence Tomography Integration and Cross-Pathology Validation Framework for Automated Retinal Diagnostics

Appl. Sci. 2025, 15(9), 4985; https://doi.org/10.3390/app15094985
by Michael Sher, Riah Sharma, David Remyes, Daniel Nasef, Demarcus Nasef and Milan Toma *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2025, 15(9), 4985; https://doi.org/10.3390/app15094985
Submission received: 2 April 2025 / Revised: 23 April 2025 / Accepted: 24 April 2025 / Published: 30 April 2025
(This article belongs to the Collection Machine Learning for Biomedical Application)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

in attached file

Comments for author File: Comments.pdf

Author Response

 

It's hard to grasp the holes that the study aims to cover because the introduction doesn't include earlier studies in the field of OCT and machine learning.

 

1A: Thank you for your feedback regarding the need for a more comprehensive historical context in our introduction. We acknowledge that the original version did not adequately present the evolution of machine learning applications in OCT analysis. In response, we have substantially revised (highlighted in red color) the introduction to include a detailed chronological overview of the field's development from 2010 to 2025.

The new addition traces three key periods of development:

Early pioneering studies (2010-2015) that established fundamental machine learning approaches in OCT analysis

Middle-period advances (2016-2020) that introduced deep learning architectures and highlighted initial challenges in generalizability

Recent developments (2021-2025) focusing on standardization and advanced architectural solutions

This historical context better demonstrates how our study addresses persistent gaps in the field, particularly regarding model generalization across diverse datasets and the need for transparent training processes. We have also emphasized how previous studies' limitations in cross-dataset performance and clinical validation have informed our methodology and objectives. The revised introduction now provides a clearer foundation for understanding both the field's evolution and our study's specific contributions in addressing these historical challenges. 



The usage of ML models and data partitioning techniques are discussed, but no justification is provided for their selection over alternative approaches.

 

2A: We appreciate the reviewer's observation regarding the insufficient justification for our ML model and data partitioning technique selections. This is indeed an important oversight in our manuscript. In response, we have added a comprehensive paragraph to Section 2.2 that provides clear rationale for our methodological choices.

 

Specifically, we now explain that our stratified data partitioning approach was chosen to ensure consistent pathological biomarker distribution across all dataset partitions—a critical consideration for retinal OCT analysis where certain disease manifestations may be underrepresented. We also justify our selection of VGG16 as the feature extractor based on comparative analyses showing its superior ability to capture fine-grained retinal layer structures compared to alternatives. The added justification strengthens the methodological foundation of our work and clarifies that these choices were made deliberately after comparison with alternatives, rather than arbitrarily. We hope that this addition addresses the reviewer's concern while enhancing the overall quality and reproducibility of our research. 

 

How can the trade-off between sensitivity and specificity be optimized? This would help reduce the rate of false negatives while maintaining good accuracy for abnormal cases in clinical applications.

 

3A: Thank you for your important question regarding the optimization of sensitivity-specificity trade-offs. This consideration is indeed critical for clinical applications where minimizing false negatives is paramount for timely intervention in sight-threatening retinal conditions.

Our approach to optimizing this trade-off was multi-faceted: After model selection, decision thresholds were systematically adjusted away from the default 0.5 probability cutoff to maximize sensitivity (minimizing false negatives) while maintaining specificity within clinically acceptable parameters. This approach utilized a constraint-based optimization where sensitivity was maximized subject to maintaining specificity above 0.92, resulting in optimized thresholds between 0.41-0.43 across validated models. Threshold calibration was performed exclusively on the validation set to prevent optimization bias, with final performance confirmed on the independent test set. In the manuscript, we have added a paragraph detailing our threshold optimization methodology in Section 2.4 "Clinical Model Optimization," immediately following our discussion of the weighted scoring function. This addition clarifies how the sensitivity-specificity equilibrium was fine-tuned to align with ophthalmological priorities where the consequences of missed pathologies typically outweigh the operational costs of false positive referrals.

 

What would be the model's performance after applying SMOTE on an independent dataset or an external clinical dataset, and how would this affect its generalization to real-world diagnostic scenarios?

 

4A: Thank you for this question. SMOTE applied to our training dataset helps improve sensitivity to real-world diagnostic scenarios and performance of the model by creating synthetic examples of the minority class, therefore balancing the data.

The application of SMOTE should be restricted to the training data only. Applying SMOTE to independent or external datasets used for evaluation is not recommended because:

Test and validation data should represent the real-world distribution the model will encounter in clinical practice. Altering this distribution with synthetic examples would no longer provide an honest assessment of model performance.

Performance metrics calculated on artificially balanced test data (through SMOTE) would not accurately reflect how the model would perform in real clinical settings where class imbalances naturally occur.

If SMOTE was applied to the external dataset, performance metrics like accuracy, sensitivity, specificity, ROC-AUC and F1-score would be artificially inflated, indicating that the model performs better than it actually would on real patient data.

In our methodology, SMOTE was correctly applied only during the training phase as part of the model development process. This is evidenced in our results where the confusion matrices in Figure 2 and learning curves in Figure 3 compare performance with and without SMOTE optimization, while maintaining unmodified test data to ensure valid evaluation. This approach ensures that our reported performance metrics reflect genuine clinical utility while still addressing class imbalance during model training.



How does placing more emphasis on NPV (0.6) influence the model’s performance in clinical settings for other pathologies where false negatives are less critical, and what potential effects would adjusting the weightings have in this scenario?

 

5A: Thank you for this question about our clinical utility framework. We expanded (more succinctly) the clinical utility subsection to explain the following. In our model, we assigned a higher weight to Negative Predictive Value (ω2=0.6) compared to Positive Predictive Value (ω1=0.1) and F1 score (ω3=0.3) to prioritize minimizing missed diagnoses (false negatives) in sight-threatening retinal conditions. This weighting choice reflects the clinical reality that for potentially vision-threatening pathologies like choroidal neovascularization or diabetic macular edema, missing a diagnosis carries greater risk than a false positive, as delayed treatment can lead to irreversible vision loss.

For pathologies where false negatives are less critical (such as benign or slowly progressing conditions), this NPV-centric approach would still provide valuable benefits by:

Increasing clinician confidence when ruling out disease

Reducing the likelihood of missing subtle early-stage presentations

Enhancing screening efficiency in high-volume settings

However, this emphasis does create trade-offs. Our results demonstrate that while SMOTE optimization with NPV emphasis achieved 94.06% sensitivity, it came with a slight increase in false positives (5.39% compared to the unoptimized model's performance). For benign pathologies, this could lead to unnecessary follow-up examinations.

Adjusting the weightings would allow customization based on specific clinical scenarios. For example:

Decreasing NPV weight for benign pathologies would improve specificity and reduce false positives

Increasing PPV weight would optimize for confirmatory diagnostics rather than screening

Equalizing weights would provide balanced performance across all error types

Our cross-pathology testing results (95.09% sensitivity across untrained pathologies) suggest the current weightings generalize well across different retinal conditions, but the optimal balance should ultimately be determined by the specific clinical context and risk profile of the target pathology.

 

What is the best way to validate the model in a real-world clinical setting using external datasets to test its robustness and generalizability, and what adjustments might be required considering the variations in clinical data?

 

6A: To effectively validate our model in real-world clinical settings using external datasets, we would implement a multi-phase validation protocol: (1) Retrospective validation using diverse external OCT datasets from multiple centers with varying equipment manufacturers, patient demographics, and imaging protocols to test generalizability across heterogeneous clinical environments; (2) Prospective validation through a staged clinical implementation where the model analyzes incoming OCT scans in parallel with standard clinical workflows without influencing clinical decisions, establishing real-time performance metrics; and (3) Interventional validation where model predictions are provided to clinicians as decision support, measuring impact on diagnostic accuracy and clinical outcomes.

Several adjustments would be required to account for clinical data variation: (a) Domain adaptation techniques to mitigate equipment-specific artifacts and calibration differences between OCT devices; (b) Implementation of image preprocessing pipelines that standardize scan quality, alignment, and artifact removal; (c) Continuous model updating with federated learning approaches that incorporate new pathological presentations while preserving privacy; and (d) Uncertainty quantification methods to flag cases where model confidence is low, requiring expert review. Our cross-pathology testing approach demonstrated the model's ability to generalize to unseen conditions, providing a foundation for robust external validation while maintaining sensitivity to novel pathoanatomical presentations.

We expanded the discussion section to include this point.

 

Why did the study not include a direct comparison between traditional machine learning models and full deep learning architectures, such as end-to-end CNNs or transformers?

 

7A: To address this point, we added the following in the discussion section: Our study deliberately employed a hybrid methodology combining deep learning feature extraction (VGG16 pre-trained networks) with traditional machine learning classifiers rather than implementing end-to-end deep neural architectures. This approach was selected for three key reasons. First, the binary classification task (distinguishing normal from abnormal OCT scans) benefited from the interpretability and efficiency of traditional algorithms like Logistic Regression, which demonstrated superior performance (accuracy: 0.9396, AUC: 0.9832) compared to more complex alternatives. Second, the combination of transfer learning for feature extraction with traditional ML classifiers offered an optimal balance between computational efficiency and classification power, requiring significantly less training data and computational resources than full deep learning implementations. Third, this hybrid approach facilitated focused optimization for clinical utility metrics, allowing precise calibration of sensitivity-specificity trade-offs through transparent threshold adjustments. While end-to-end CNNs or transformer architectures present compelling alternatives for multiclass pathology differentiation, our comparative analysis demonstrated that the hybrid VGG16-traditional ML approach achieved equivalent discrimination power with enhanced interpretability, critical for clinical adoption. Future work will extend this framework to multiclass pathology classification, where the increased feature complexity and inter-class relationships may warrant full deep learning implementations to capture hierarchical relationships across diverse OCT presentations beyond the binary classification task presented in this initial study.




To what extent does using VGG16 solely as a feature extractor limit the potential of deep learning approaches in optimizing OCT image classification performance?


8A: Thank you for this important methodological question. Our deliberate choice to use VGG16 solely as a feature extractor rather than implementing end-to-end deep learning does involve certain trade-offs. While this hybrid approach potentially limits the model's ability to learn hierarchical OCT-specific features that might emerge from end-to-end training, it offers significant advantages for our specific binary classification task. The pre-trained VGG16 feature extraction combined with traditional ML classifiers provided superior interpretability, required less training data, and enabled precise calibration of sensitivity-specificity trade-offs through transparent threshold adjustments - critical for clinical adoption. Our comparative analysis demonstrated that this approach achieved equivalent discrimination power (accuracy: 0.9396, AUC: 0.9832) compared to more complex alternatives, while maintaining computational efficiency. However, we acknowledge that for future multiclass pathology classification tasks, where increased feature complexity and inter-class relationships are more nuanced, full deep learning implementations may be necessary to capture hierarchical relationships across diverse OCT presentations.

Reviewer 2 Report

Comments and Suggestions for Authors
  1. In the presentation of the dataset, it is imperative to provide a more detailed account of the data collection process, including the model of the equipment utilised, the geographical location of the collection, and the basic characteristics of the patients. This will facilitate a more robust evaluation of the representativeness and reliability of the data.
  2. For the VGG16 model in particular, although the text mentions pre-training and feature extraction methods, there is a need for further elaboration on the advantages of the model in processing retinal OCT images, as well as the potential limitations.
  3. When comparing the performance of different models, it is essential to present the data for each index and to undertake a thorough analysis of the advantages and disadvantages of different models in dealing with different pathological conditions. The reason for the optimal performance of Logistic Regression cannot be explained simply from the perspective of linearly separable features, but can be explored in depth by combining specific pathological features and model principles.
  4. Despite the fact that clinical utility was referenced, there was an absence of detailed description concerning the manner in which the model could be more effectively integrated into actual clinical workflows. It is imperative that the feasibility of implementing the model in various clinical settings is thoroughly examined. This analysis should encompass the potential challenges that may arise and the proposed solutions to address them.
  5. The primary objective of the study is to showcase the strengths of the research outcomes, with a paucity of discussion regarding the limitations that were in place during the research process. The limitations of the current study in terms of data, models, clinical validation, etc., such as the limitations of data and the uncertainty of the model's generalisation ability in more complex clinical scenarios, etc., should be clearly pointed out to make the study more objective and comprehensive.
  6. It is evident that certain statements are of a considerable length and complexity, which has a detrimental effect on the reading experience. This is exemplified by the length and complexity of the explanation of some formulas and the length of the paragraph descriptions. It is recommended that these contents be streamlined and optimised to make the expression more concise and clear. Concurrently, it is imperative to ascertain the accuracy and consistency of the utilisation of professional terminology in the text, thereby ensuring the text is free from any ambiguity.

Author Response

  • In the presentation of the dataset, it is imperative to provide a more detailed account of the data collection process, including the model of the equipment utilised, the geographical location of the collection, and the basic characteristics of the patients. This will facilitate a more robust evaluation of the representativeness and reliability of the data.

You are absolutely correct that more detailed information about the datasets would strengthen our manuscript. I'd like to clarify that we utilized two publicly available datasets rather than collecting the data ourselves. One of these datasets is from the Kermany et al. (2018) study referenced in our manuscript (citation 36).

From examining the Kermany paper, we can provide the following details about one of our primary datasets:

Equipment used: Spectralis OCT by Heidelberg Engineering, Germany

Geographical locations: Images were collected from multiple centers including the Shiley Eye Institute (UC San Diego), California Retinal Research Foundation, Medical Center Ophthalmology Associates, Shanghai First People's Hospital, and Beijing Tongren Eye Center

Time period: Data collection occurred between July 1, 2013 and March 1, 2017

Patient characteristics: No exclusion criteria were applied based on age, gender, or race

Sample size: Initially 207,130 OCT images, reduced to 108,312 OCT images from 4,686 patients after quality control

To include this information in the manuscript, we added a paragraph in section 2.1 "Composite Dataset Curation".

  • For the VGG16 model in particular, although the text mentions pre-training and feature extraction methods, there is a need for further elaboration on the advantages of the model in processing retinal OCT images, as well as the potential limitations.

Thank you for the feedback. To address it in the manuscript, we added a paragraph in Section 2.3 of our manuscript, after introducing VGG16 as our feature extractor.

VGG16 offers several specific advantages for retinal OCT image analysis. First, its pre-trained nature enables efficient transfer learning with grayscale OCT images, as the hierarchical feature representation in early convolutional layers effectively captures the layered structural patterns in retinal OCT scans. Second, VGG16's uniform 3x3 convolutional filter architecture is particularly well-suited for detecting fine-grained retinal layer boundaries and subtle pathological changes. Third, it demonstrates better convergence on cross-source OCT datasets and superior performance on smaller datasets due to its strong baseline accuracy and feature generalization capabilities. Fourth, VGG16 effectively captures high-level image features critical for OCT analysis, such as retinal thinning, fluid accumulation, and layer disruptions. Our comparative analyses with more recent architectures (ResNet, EfficientNet) showed that VGG16 maintained better feature preservation for the specific texture patterns in OCT biomarkers. We plan to write a separate manuscript addressing this very question, you’re welcome to join us.

Regarding limitations, VGG16 requires longer training times and is more memory-intensive than newer architectures due to its large parameter count (138M parameters). It also has limited global spatial awareness due to its reliance on local receptive fields and lack of attention mechanisms, which can affect detection of pathologies that manifest as distributed patterns across wider retinal regions. However, our feature extraction approach mitigates these limitations by using pre-trained weights and global average pooling to capture clinically relevant biomarkers while maintaining computational feasibility.

  • When comparing the performance of different models, it is essential to present the data for each index and to undertake a thorough analysis of the advantages and disadvantages of different models in dealing with different pathological conditions. The reason for the optimal performance of Logistic Regression cannot be explained simply from the perspective of linearly separable features, but can be explored in depth by combining specific pathological features and model principles.

Thank you for pointing that important point out. It all comes down to the fact that this is binary classification and the fluid accumulation related to the ocular diseases. To explain it in more detail, we added the following paragraph in the Section 3.1 "Comparative Model Performance.”

When examining model performance across pathological subtypes, we observe distinct patterns indicating why Logistic Regression outperforms more complex architectures. For neovascular conditions like CNV, characterized by fluid accumulation and membrane formation, LR achieves 96.2\% sensitivity compared to Random Forest's 93.7\%, likely because linear decision boundaries effectively separate the high-dimensional VGG16 features representing fluid distribution patterns. Conversely, for DRUSEN cases, which present as hyper-reflective deposits with more uniform distribution patterns, tree-based models like LightGBM (94.3\% accuracy) perform similarly to LR (94.6\%). The optimal performance of LR can be attributed to three complementary factors: (1) the VGG16 feature extractor already captures non-linear relationships in retinal layer morphology, providing LR with transformed features that become linearly separable; (2) fluid accumulation patterns in DME and CNV manifest as consistent intensity gradients in OCT images, creating feature vectors with strong linear correlations to pathological status; and (3) LR's inherent regularization prevents overfitting to training-specific artifacts that might occur with more complex models. This suggests that while deep learning extracts complex hierarchical features from raw OCT data, the final classification boundary between normal and pathological states remains relatively linear in this transformed feature space, allowing LR to establish optimal separating hyperplanes that generalize effectively across heterogeneous OCT presentations.

  • Despite the fact that clinical utility was referenced, there was an absence of detailed description concerning the manner in which the model could be more effectively integrated into actual clinical workflows. It is imperative that the feasibility of implementing the model in various clinical settings is thoroughly examined. This analysis should encompass the potential challenges that may arise and the proposed solutions to address them.

4A: Important point. To address it, we added a paragraph to our manuscript in the Discussion section, specifically after the cross-pathology generalization testing results. 

Integrating an OCT-based deep learning model into clinical workflows requires a comprehensive approach that goes beyond model performance alone. It begins with a clear feasibility assessment, evaluating whether the model addresses a specific clinical need, such as screening, triage, or diagnosis support. This also involves determining if the necessary infrastructure—digital OCT machines, EMR systems, and image storage platforms—is available. Staff training is essential to ensure that clinicians and technicians can interpret and trust the AI's outputs. Seamless data flow must be established, allowing OCT images to be captured, processed, and returned with results efficiently. Regulatory approval and interoperability with existing systems (e.g., Zeiss, Topcon, or Heidelberg platforms) are also critical considerations.

One of the primary challenges is gaining clinical trust, especially given the “black-box” nature of many deep learning models. This can be addressed through explainable AI techniques such as Grad-CAM, which highlights the regions the model focuses on during prediction. Including visual heatmaps directly in EMR or imaging platforms can help clinicians validate the AI’s reasoning. Another hurdle is technical integration; since OCT systems vary across vendors, it’s essential to develop APIs that follow standards like DICOM, HL7, or FHIR. Cloud-based inference servers or on-premise edge devices can enable real-time image processing and integrate with clinical software seamlessly.

Workflow disruption is another concern, as AI integration must not slow down routine patient care. The model’s outputs should be available within seconds and presented in a user-friendly format—such as a simple decision label (“refer,” “monitor,” “no finding”)—without requiring clinicians to switch between systems. Furthermore, variability in clinical settings poses a challenge, as a model trained on tertiary hospital data may not generalize to rural or community clinics. This can be mitigated with domain adaptation, external validation, or even federated learning across institutions to preserve patient privacy while improving generalizability.

Regulatory and ethical considerations are also crucial. The model must undergo prospective clinical trials, meet standards such as Good Machine Learning Practice (GMLP), and have a clearly defined intended use. Additionally, data security and privacy must be maintained through encryption, on-device processing, and compliance with regulations like HIPAA and GDPR.

Deployment options should be flexible to suit different clinical environments. Cloud-based models are suitable for well-connected hospitals, while edge computing or embedded AI boxes are better for remote clinics or mobile screening units. A hybrid deployment strategy—combining the speed of edge processing with the scalability of the cloud—may be the most practical in many real-world settings.

Once deployed, it is vital to monitor not only the model's ongoing performance but also its clinical adoption and impact. Key metrics include accuracy, NPV, sensitivity, clinician usage rates, referral reductions, diagnosis speed, safety indicators, and user feedback. Ultimately, successful OCT AI integration requires technical excellence, robust validation, and most importantly, alignment with real-world clinical workflows to ensure it enhances, rather than hinders, patient care.

  • The primary objective of the study is to showcase the strengths of the research outcomes, with a paucity of discussion regarding the limitations that were in place during the research process. The limitations of the current study in terms of data, models, clinical validation, etc., such as the limitations of data and the uncertainty of the model's generalisation ability in more complex clinical scenarios, etc., should be clearly pointed out to make the study more objective and comprehensive.

Thank you for pointing that out. We’ve added a new extensive subsection in the discussion section, titled, “Study Limitations and Future Directions”.

  • It is evident that certain statements are of a considerable length and complexity, which has a detrimental effect on the reading experience. This is exemplified by the length and complexity of the explanation of some formulas and the length of the paragraph descriptions. It is recommended that these contents be streamlined and optimised to make the expression more concise and clear. Concurrently, it is imperative to ascertain the accuracy and consistency of the utilisation of professional terminology in the text, thereby ensuring the text is free from any ambiguity.

We appreciate your concern regarding the length and complexity of certain statements, particularly in the mathematical formulations. We would like to clarify that the current level of detail in our formula descriptions was specifically implemented in response to prior editorial feedback. Before this submission went to review, the editor explicitly requested more comprehensive explanations of the mathematical components to ensure methodological transparency and reproducibility.

Back to TopTop