Learning Dynamics Analysis: Assessing Generalization of Machine Learning Models for Optical Coherence Tomography Multiclass Classification
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper has shown how learning dynamic analysis can be used to assess the generalisation of machine learning models in healthcare settings. The authors do a good job of conveying the importance and necessity for assessing the generalisation of machine learning models by testing with new data that was not part of the training dataset especially for clinical interpretation.
The introduction is detailed and the scientific need of the study is conveyed clearly.
The development of the machine learning algorithms and method of analysis are detailed clearly in the method section.
One area of improvement identified is the discussion section. Although the authors have rightly pointed out the limitations of cross-study comparison, it would be good if some comparisons where still provided notwithstanding the limitations raised. The lack of comparison with similar studies that have looked at generalisation of machine learning models in clinical settings takes away from the scientific soundness of the study.
For additional comments, see attached file
Comments for author File:
Comments.pdf
Author Response
Comment #1: Methodology Improvement: “The one area of improvement to suggest here is that the authors present an overview of the pipeline and development of other machine learning models tested”
Response: Thank you for the constructive feedback. In response to your suggestion, we have completely rewritten Section 2.3 to provide the requested overview of the pipeline and development of the other machine learning models tested. This new section, now details the AutoML screening with PyCaret, the architecture and training of the CNN models (both with and without augmentation), and provides context for the DNN approach, thereby addressing your comment in full. We appreciate your guidance in helping us improve the manuscript.
Comment #2: Discussion Section Improvement: “The lack of comparison with other published studies reduces the scientific soundness of the study… there is still a need to show how the current research has gone beyond other published studies.” **This comment could be focused toward section 4.2
Response: Thank you for this suggestion. We will definitely add to our paper a few examples of how the results in machine learning research within the field presents the data in a specific way and most of the time for these cases leaves out vital information about the progress of the model training.
Changes in the manuscript: We have completely rewritten and expanded the entire section 4.2, as requested.
Reviewer 2 Report
Comments and Suggestions for Authorsthe attached manuscrippt is a strong piece of work and a a good article. but have investigative some points. The idea to check learning dynamics is very good, very important for clinical use, everyone look only at final accuracy usually. They use public datasets which is good for reproducibility. But why they not use more modern architecture than VGG16? Like ResNet or EfficientNet, maybe can get better features, no? Also the DNN, while it generalize good, its final accuracy on external set is only 76%, this is not so high for real clinic, I think. They say class 4, Drusen, is big problem, many errors, but they not do much to fix this, maybe need more data for this class or special augmentation. And the augmented CNN have suspicious good result on external data, they say it is problem, but they not investigate why it happen, just ignore it. Maybe need more deep look. Also all training is only 10 epochs, maybe too small, not sure if model converge fully. But overall, work is good direction, just if possible need more deep experiments and maybe better model architectures, or at least comment and analysis on these issuues. These points should be adressed in the revision to make it more suitable..
Author Response
Comment #1: “But why they not use more modern architecture than VGG16?”
Response: We appreciate the reviewer's question regarding our choice of VGG16. We specifically selected VGG16 as a frozen feature extractor (Section 2.5) rather than for end-to-end training, which is an important distinction. VGG16 remains widely adopted in medical imaging transfer learning applications due to its straightforward architecture, computational efficiency, and well-established feature representations. Our study's primary contribution centers on the detailed evaluation framework; specifically, the analysis of learning dynamics, multi-split validation, and external generalization (Sections 2.8-2.10, Table 1). Using a well-validated feature extractor like VGG16 strengthens this methodology by providing stable, interpretable features while allowing us to focus on the DNN classifier's learning behavior and generalization capability.
We note that 'modern' architectures do not uniformly outperform established ones, particularly in transfer learning scenarios with medical imaging data. The choice of architecture should be driven by the specific research questions and practical constraints rather than novelty alone. Given that our models demonstrated healthy learning dynamics and robust external validation performance (Figures 5-7), VGG16 proved appropriate for this application.
Changes in the manuscript: We’ve expanded the section, 2.3. Model Selection and Methodological Focus, to explain our reasoning as explained above.
Comment #2: “Also the DNN, while it generalize good, its final accuracy on external set is only 76%, this is not so high for real clinic, I think.”
Response: Thank you for this comment and observation. The goal of our research was to see which model would have the best performance when tasked with multiclass classification. Although the DNN model did have a low accuracy of 76% when tested on the external dataset, it showed the healthiest monotonic decrease in performance metrics when compared to the other models. Other studies in the field largely do not include external datasets in their work, thus it was a bit difficult for us to find a comparison of our external accuracy to those of other models. Our CNN models do come close and even exceed some of the models created in the field, but based on our analysis the CNN model did not show as healthy of a training performance and even had a lower F1 score on the external dataset when compared to the DNN model. While we do know that our models are not yet up to the level of clinical applicability, we hope our research can be used as the foundation to utilize proper training techniques and data reporting to build models with improved accuracy and thus a higher chance of clinical applicability in the future.
Changes in the manuscript: We’ve expanded Section 4.1 to reiterate the above point.
Comment #3: “They say class 4, Drusen, is big problem, many errors, but they not do much to fix this, maybe need more data for this class or special augmentation.”
Response: Thank you for putting forth this concern within our results. The reason the prediction of drusen would be hard to fix, stems from the fact that the concept of a drusen can be found as an isolated finding and a diagnostic hallmark of other pathologies such as AMD and CNV. A drusen occurs as a result of leaky vessels depositing proteins and fats under the retina, which can be seen in many diseases. The main difference between the classification of just a drusen and a pathology, such as AMD, is largely based on the amount of drusens that are present in an individual’s retina during observation. We will look to also include this clarification in our paper to make certain that the reason for the drusen misclassification is explained.
Addition to Paper (changes after line 401, in section 3.3.2): Thus, this misclassification reflects the underlying clinical relationship between drusen and these retinal pathologies. Drusen are extracellular deposits that accumulate between the retinal pigment epithelium and Bruch's membrane, and their presence exists on a clinicopathological spectrum. While a few small (hard) drusen are considered a normal aging change, the accumulation of larger and more numerous drusen represents the pathological hallmark of early age-related macular degeneration. Importantly, drusen are not merely associated with AMD; they are an integral component of the disease process itself, with large confluent drusen representing pathological signs rather than benign age-related changes. This creates a fundamental classification challenge: images labeled as ``Drusen'' (Class 4) often contain significant drusen burden that is clinically indistinguishable from early-stage AMD (Class 8), since substantial drusen accumulation is itself a manifestation of AMD pathology. Similarly, drusen presence substantially increases the risk of progression to choroidal neovascularization, the defining feature of wet AMD, creating feature overlap between Class 4 (Drusen) and Class 2 (CNV). The model's frequent confusion among these classes therefore reflects genuine clinical ambiguity in the continuum from normal aging to drusen accumulation to advanced macular degeneration, rather than pure classification error. OCT images showing prominent drusen may legitimately belong to multiple disease categories depending on additional clinical context not captured in the imaging alone, making this a boundary where even expert human graders would demonstrate inter-rater variability.
Comment #4: “And the augmented CNN have suspicious good result on external data, they say it is problem, but they not investigate why it happen, just ignore it”
Response: We appreciate the reviewer's important observation regarding the augmented CNN's external validation performance. The reviewer is correct that our initial manuscript flagged this as problematic but did not investigate the underlying cause.
Changes in the manuscript: In response to this comment, we have added a detailed explanation in Section 4.1 that articulates why unexpected performance improvements on external data cannot be accepted at face value. The new paragraph explains that external datasets should represent the most challenging evaluation scenario due to differences in acquisition protocols, patient populations, and imaging characteristics. When performance paradoxically improves externally, this typically indicates that the external dataset inadvertently simplified the classification task through favorable class distributions, clearer image quality, or exclusion of diagnostically challenging cases. Such anomalies signal evaluation artifacts rather than genuine generalization capability, rendering the augmented CNN's superficially impressive external metrics unreliable indicators of clinical utility. This expanded discussion clarifies why non-monotonic performance patterns disqualify models from clinical consideration regardless of their numerical performance values.
Comment #5: “Also all training is only 10 epochs, maybe too small, not sure if model converge fully”
Response: Thank you for raising the question regarding the selection of 10 training epochs. We appreciate the opportunity to clarify this methodological decision, which was based on rigorous empirical evaluation rather than arbitrary convention. Preliminary experiments demonstrated that validation accuracy consistently plateaued within 8-10 epochs for our specific dataset and DNN architecture, with negligible performance improvements beyond this point. This convergence pattern is clearly visible in our reported learning curves (Figure 5, lines 379-385), where both training and validation accuracy show smooth, parallel progression that stabilizes by epoch 8-9. Our DNN utilizes pre-trained VGG16 features as fixed representations (Section 2.5, lines 162-170), meaning only the classifier head undergoes training on high-level feature embeddings rather than learning convolutional filters from random initialization. This substantially simplifies the optimization landscape and accelerates convergence compared to end-to-end training approaches. Transfer learning architectures of this type typically achieve stable convergence within 5-15 epochs depending on dataset complexity. Training beyond observed convergence would risk overfitting, where the model begins memorizing training-specific patterns rather than learning generalizable features.
Changes in the manuscript: We have added detailed justification text in Section 2.7 (Training Methodology and Optimization) immediately following line 186-187. Specifically, the new explanatory paragraph has been inserted between the sentence "The training loop was executed for 10 epochs, with each epoch consisting of the operations detailed in Algorithm 3." (line 186-187) and the beginning of Algorithm 3 (line 188).
