Efficient Ensemble Learning with Curriculum-Based Masked Autoencoders for Retinal OCT Classification
Abstract
1. Introduction
- •
- We introduce CurriMAE-Soup, which combines multiple pretrained snapshots into a single model using the model soup technique, eliminating the need for multiple fine-tuning processes and enabling efficient inference.
- •
- We propose CurriMAE-Greedy, which ensembles only the top-performing fine-tuned snapshots (two or three models), reducing storage and inference costs compared to full snapshot ensembles while maintaining high performance.
- •
- We conduct experiments on OCT classification, demonstrating that both CurriMAE-Soup and CurriMAE-Greedy achieve competitive or superior performance compared to other methods.
- •
- We analyze the trade-offs between performance and computational efficiency across different ensemble strategies, providing practical insights for deploying SSL models in clinical applications.
2. Related Works
2.1. OCT Disease Classification with Deep Learning
2.2. Masked Image Modeling for OCT
2.3. Ensemble Learning and Efficient Model Aggregation Strategies
3. Materials and Methods
3.1. OCT Datasets
3.2. Pretraining: Progressive Masking with MAE via Curriculum Learning
3.3. Fine-Tuning: Downstream Classification with Three Ensemble Strategies
3.3.1. Simple Averaging Ensemble—CurriMAE
3.3.2. Parameter Averaging—CurriMAE-Soup
3.3.3. Greedy Snapshot Selection—CurriMAE-Greedy
3.4. Computational Cost and Resource Usage
4. Results
4.1. Comparison of Supervised and Self-Supervised Baselines with CurriMAE
4.2. Fixed vs. Adaptive Epoch Scheduling in CurriMAE
4.3. Performance Comparison: CurriMAE Ensembles vs. Individually Fine-Tuned Snapshots
4.4. Ablation Study
4.4.1. Effect of Curriculum-Based Progressive Masking
4.4.2. Effect of Snapshot-Based Ensemble Strategies
4.4.3. Effect of Greedy Model Selection and Ensemble Size
4.4.4. Effect of Epoch Scheduling Strategy
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| ACC | Accuracy |
| AUC | Area under the receiver operating characteristic curve |
| AUPRC | Area under the precision–recall curve |
| BERT | Bidirectional encoder representations from transformers |
| CNN | Convolutional neural networks |
| CNV | Choroidal neovascularization |
| CurriMAE | Curriculum learning-based masked autoencoders |
| DME | Diabetic macular edema |
| ERM | Epiretinal membrane |
| FLOPs | Floating point operations |
| F1 | F1-score |
| MAE | Masked autoencoders |
| MIM | Masked image modeling |
| MSE | Mean squared error |
| NOR | Normal |
| OCT | Optical coherence tomography |
| PRE | Precision |
| RAO | Retinal artery occlusion |
| RVO | Retinal vein occlusion |
| SEN | Sensitivity |
| Snap-MAE | Snapshot ensemble learning-based masked autoencoders |
| SSL | Self-supervised learning |
| VID | Vitreomacular interface disease |
| ViT | Vision transformer |
References
- Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef] [PubMed]
- Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M.C.; Wu, D.; Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams, T.; Cuadros, J.; et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 2016, 316, 2402–2410. [Google Scholar] [CrossRef] [PubMed]
- Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.; van Ginneken, B.; Sanchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
- Tajbakhsh, N.; Shin, J.Y.; Gurudu, S.R.; Hurst, R.T.; Kendall, C.B.; Gotway, M.B.; Jianming, L. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans. Med. Imaging 2016, 35, 1299–1312. [Google Scholar] [CrossRef]
- Khalid, U.; Kaya, M.; Alhajj, R. Improving Coronary Artery Disease Diagnosis in Cardiac MRI with Self-Supervised Learning. Diagnostics 2025, 15, 2618. [Google Scholar] [CrossRef]
- Chaitanya, K.; Erdil, E.; Karani, N.; Konukoglu, E. Contrastive learning of global and local features for medical image segmentation with limited annotations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; p. 1052. [Google Scholar]
- Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-Supervised Learning: Generative or Contrastive. IEEE Trans. Knowl. Data Eng. 2023, 35, 857–876. [Google Scholar]
- Xie, Z.D.; Zhang, Z.; Cao, Y.; Lin, Y.T.; Bao, J.M.; Yao, Z.L.; Dai, Q.; Hu, H. SimMIM: A Simple Framework for Masked Image Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 19–24 June 2022; pp. 9643–9653. [Google Scholar] [CrossRef]
- He, K.M.; Chen, X.L.; Xie, S.N.; Li, Y.H.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 19–24 June 2022; pp. 15979–15988. [Google Scholar] [CrossRef]
- Peng, J.; Zhao, H.; Zhao, K.; Wang, Z.; Yao, L. Dynamic background reconstruction via masked autoencoders for infrared small target detection. Eng. Appl. Artif. Intell. 2024, 135, 108762. [Google Scholar] [CrossRef]
- Dong, X.Y.; Bao, J.M.; Zhang, T.; Chen, D.D.; Zhang, W.M.; Yuan, L.; Chen, D.; Wen, F.; Yu, N. Bootstrapped Masked Autoencoders for Vision BERT Pretraining. In Proceedings of the Computer Vision–ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Volume 13690, pp. 247–264. [Google Scholar] [CrossRef]
- Yoon, T.; Kang, D. CurriMAE: Curriculum learning based masked autoencoders for multi-labeled pediatric thoracic disease classification. PeerJ Comput. Sci. 2025, 11, e3019. [Google Scholar] [CrossRef]
- Huang, G.; Li, Y.; Pleiss, G.; Liu, Z.; Hopcroft, J.E.; Weinberger, K.Q. Snapshot ensembles: Train 1, get m for free. arXiv 2017, arXiv:1704.00109. [Google Scholar] [CrossRef]
- Kermany, D.S.; Goldbaum, M.; Cai, W.; Valentim, C.C.S.; Liang, H.; Baxter, S.L.; McKeown, A.; Yang, G.; Wu, X.; Yan, F.; et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell 2018, 172, 1122–1131.e9. [Google Scholar] [CrossRef]
- Ardelean, A.I.; Ardelean, E.R.; Marginean, A. Can YOLO Detect Retinal Pathologies? A Step Towards Automated OCT Analysis. Diagnostics 2025, 15, 1823. [Google Scholar] [CrossRef] [PubMed]
- Miladinovic, A.; Biscontin, A.; Ajcevic, M.; Kresevic, S.; Accardo, A.; Marangoni, D.; Tognetto, D.; Inferrera, L. Evaluating deep learning models for classifying OCT images with limited data and noisy labels. Sci. Rep. 2024, 14, 30321. [Google Scholar] [CrossRef] [PubMed]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Pissas, T.; Márquez-Neila, P.; Wolfe, S.; Zinkernagel, M.; Sznitman, R. Masked Image Modelling for Retinal OCT Understanding. Ophthalmic Med. Image Anal. Omia 2024, 15188, 115–125. [Google Scholar] [CrossRef]
- Wang, H.; Guo, X.; Song, K.; Sun, M.; Shao, Y.; Xue, S.; Zhang, H.; Zhang, T. GO-MAE: Self-supervised pre-training via masked autoencoder for OCT image classification of gynecology. Neural Netw. 2025, 181, 106817. [Google Scholar] [CrossRef]
- Yoon, T.; Kang, D. Self-distilled masked autoencoders for medical images. Eng. Appl. Artif. Intell. 2025, 160, 112055. [Google Scholar] [CrossRef]
- Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2025. [Google Scholar]
- Dietterich, T.G. Ensemble methods in machine learning. In Multiple Classifier Systems; Springer: Berlin/Heidelberg, Germany, 2000; Volume 1857, pp. 1–15. [Google Scholar] [CrossRef]
- Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
- Ganaie, M.A.; Hu, M.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
- Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
- Wen, Y.; Tran, D.; Ba, J. BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning. arXiv 2020, arXiv:2002.06715. [Google Scholar] [CrossRef]
- Geetha, A.; Carmel Sobia, M.; Santhi, D.; Ahilan, A. DEEP GD: Deep learning based snapshot ensemble CNN with EfficientNet for glaucoma detection. Biomed. Signal Process. Control. 2025, 100, 106989. [Google Scholar] [CrossRef]
- Yoon, T.; Kang, D. Integrating snapshot ensemble learning into masked autoencoders for efficient self-supervised pretraining in medical imaging. Sci. Rep. 2025, 15, 31232. [Google Scholar] [CrossRef]
- Jang, D.H.; Yun, S.; Han, D. Model Stock: All We Need Is Just a Few Fine-Tuned Models. In Proceedings of the Computer Vision-ECCV 2024, Milan, Italy, 29 September–4 October 2024; Volume 15102, pp. 207–223. [Google Scholar] [CrossRef]
- Wortsman, M.; Ilharco, G.; Gadre, S.Y.; Roelofs, R.; Gontijo-Lopes, R.; Morcos, A.S.; Namkoong, H.; Farhadi, A.; Carmon, Y.; Kornblith, S.; et al. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Volume 162. [Google Scholar]
- Maron, R.C.; Hekler, A.; Haggenmuller, S.; von Kalle, C.; Utikal, J.S.; Muller, V.; Gaiser, M.; Meier, F.; Hobelsberger, S.; Gellrich, F.F.; et al. Model soups improve performance of dermoscopic skin cancer classifiers. Eur. J. Cancer 2022, 173, 307–316. [Google Scholar] [CrossRef]
- Kulyabin, M.; Zhdanov, A.; Nikiforova, A.; Stepichev, A.; Kuznetsova, A.; Ronkin, M.; Borisov, V.; Bogachev, A.; Korotkich, S.; Constable, P.A.; et al. OCTDL: Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods. Sci. Data 2024, 11, 365. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Xiao, J.; Bai, Y.; Yuille, A.; Zhou, Z. Delving into Masked Autoencoders for Multi-Label Thorax Disease Classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–7 January 2023; pp. 3588–3600. [Google Scholar] [CrossRef]
- He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]






| Phases | Dataset Name | Classe Labels † | Number of Pretraining Data | Number of Train Data | Number of Validation Data * | Number of Test Data |
|---|---|---|---|---|---|---|
| Pretraining | Kermany | - | 108,309 | - | - | - |
| Fine-tuning | OCTDL (7 classes, Total = 2064) | AMD | - | 796 | 199 | 236 |
| NOR | - | 214 | 53 | 65 | ||
| ERM | - | 100 | 25 | 30 | ||
| DME | - | 94 | 24 | 29 | ||
| RVO | - | 65 | 16 | 20 | ||
| VID | - | 49 | 12 | 15 | ||
| RAO | - | 10 | 3 | 9 |
| Phase | Metric | MAE | CurriMAE | CurriMAE-Soup | CurriMAE-GE2 |
|---|---|---|---|---|---|
| Pretraining | Pretraining runs | * | 1 | 1 | 1 |
| FLOPs per sample (G) | 1.78 | 1.78 | 1.78 | 1.78 | |
| Parameters (Millions) | 22.14 | 22.14 | 22.14 | 22.14 | |
| Model Size (MB) | 84.89 | 84.89 | 84.89 | 84.89 | |
| Max GPU Memory Usage (MB) | 1639.89 | 1639.89 | 1639.89 | 1639.89 | |
| Training Time per Epoch (min:sec) | 4:36 | 4:30 | 4:30 | 4:30 | |
| Fine-tuning | Fine-tuning runs | + | 1 | ||
| FLOPs per sample (G) | 6.44 | 6.44 | 6.44 | 6.44 | |
| Parameters (Millions) | 21.67 | 21.67 | 21.67 | 21.67 | |
| Model Size (MB) | 82.71 | 82.71 | 82.71 | 82.71 | |
| Max GPU Memory Usage (MB) | 577.78 | 577.78 | 577.78 | 577.78 | |
| Total Training Time (min:sec) | 1:47 | 1:49 | 1:48 | 1:48 | |
| Inference | Inference models required | 1 | 2 | ||
| Inference FLOPs per sample (G) | 6.44 | 6.44 | 6.44 | 6.44 |
| Models | AUC | AUPRC | ACC | SEN | PRE | F1 |
|---|---|---|---|---|---|---|
| ResNet-34 (FS) * | 0.980 (0.002) | 0.906 (0.009) | 85.56 (0.62) | 0.855 (0.006) | 0.860 (0.006) | 0.856 (0.005) |
| ViT-S (FS) | 0.907 (0.013) | 0.764 (0.014) | 71.20 (1.36) | 0.712 (0.014) | 0.625 (0.036) | 0.637 (0.028) |
| ResNet-34 (IN) + | 0.991 (0.003) | 0.950 (0.012) | 91.34 (2.62) | 0.913 (0.026) | 0.914 (0.026) | 0.912 (0.027) |
| ViT-S (IN) | 0.992 (0.001) | 0.953 (0.006) | 91.01 (1.00) | 0.910 (0.010) | 0.914 (0.008) | 0.911 (0.009) |
| ResNet-34 (OCT) ** | 0.989 (0.001) | 0.937 (0.004) | 88.20 (0.14) | 0.882 (0.002) | 0.881 (0.004) | 0.879 (0.003) |
| ViT-S (OCT) | 0.879 (0.021) | 0.718 (0.024) | 69.80 (1.28) | 0.698 (0.013) | 0.544 (0.030) | 0.606 (0.017) |
| MAE 60% (OCT) | 0.994 (0.001) | 0.955 (0.005) | 92.74 (1.03) | 0.928 (0.010) | 0.931 (0.010) | 0.929 (0.010) |
| MAE 70% (OCT) | 0.993 (0.002) | 0.953 (0.006) | 92.41 (0.94) | 0.924 (0.010) | 0.927 (0.007) | 0.925 (0.009) |
| MAE 80% (OCT) | 0.994 (0.001) | 0.943 (0.006) | 91.34 (0.66) | 0.913 (0.007) | 0.917 (0.006) | 0.912 (0.010) |
| MAE 90% (OCT) | 0.991 (0.002) | 0.938 (0.010) | 91.75 (1.74) | 0.906 (0.011) | 0.911 (0.012) | 0.916 (0.018) |
| CurriMAE (OCT) | 0.994 (0.001) | 0.956 (0.005) | 92.82 (0.00) | 0.928 (0.000) | 0.929 (0.001) | 0.928 (0.000) |
| CurriMAE-Soup (OCT) | 0.994 (0.002) | 0.951 (0.009) | 92.41 (0.62) | 0.924 (0.007) | 0.926 (0.005) | 0.924 (0.006) |
| CurriMAE-GE2 (OCT) | 0.995 (0.001) | 0.960 (0.004) | 93.32 (0.25) | 0.933 (0.003) | 0.934 (0.002) | 0.933 (0.002) |
| CurriMAE-GE3 (OCT) | 0.995 (0.001) | 0.959 (0.005) | 93.32 (0.50) | 0.933 (0.005) | 0.935 (0.005) | 0.933 (0.005) |
| Models | AUC | AUPRC | ACC | SEN | PRE | F1 | |
|---|---|---|---|---|---|---|---|
| Fixed epochs | CurriMAE (OCT) | 0.994 (0.001) | 0.956 (0.005) | 92.82 (0.00) | 0.928 (0.000) | 0.929 (0.001) | 0.928 (0.000) |
| CurriMAE-Soup (OCT) | 0.994 (0.002) | 0.951 (0.009) | 92.41 (0.62) | 0.924 (0.007) | 0.926 (0.005) | 0.924 (0.006) | |
| CurriMAE-GE2 (OCT) | 0.995 (0.001) | 0.960 (0.004) | 93.32 (0.25) | 0.933 (0.003) | 0.934 (0.002) | 0.933 (0.002) | |
| CurriMAE-GE3 (OCT) | 0.995 (0.001) | 0.959 (0.005) | 93.32 (0.50) | 0.933 (0.005) | 0.935 (0.005) | 0.933 (0.005) | |
| Adaptive epochs | CurriMAE (OCT) | 0.993 (0.001) | 0.952 (0.002) | 92.90 (0.14) | 0.929 (0.002) | 0.930 (0.001) | 0.929 (0.001) |
| CurriMAE-Soup (OCT) | 0.993 (0.002) | 0.948 (0.008) | 91.58 (0.25) | 0.916 (0.003) | 0.919 (0.003) | 0.916 (0.002) | |
| CurriMAE-GE2 (OCT) | 0.995 (0.001) | 0.956 (0.001) | 93.23 (0.57) | 0.933 (0.006) | 0.933 (0.006) | 0.932 (0.006) | |
| CurriMAE-GE3 (OCT) | 0.995 (0.001) | 0.956 (0.002) | 93.15 (0.52) | 0.932 (0.005) | 0.932 (0.007) | 0.931 (0.006) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yoon, T.; Kang, D. Efficient Ensemble Learning with Curriculum-Based Masked Autoencoders for Retinal OCT Classification. Diagnostics 2026, 16, 179. https://doi.org/10.3390/diagnostics16020179
Yoon T, Kang D. Efficient Ensemble Learning with Curriculum-Based Masked Autoencoders for Retinal OCT Classification. Diagnostics. 2026; 16(2):179. https://doi.org/10.3390/diagnostics16020179
Chicago/Turabian StyleYoon, Taeyoung, and Daesung Kang. 2026. "Efficient Ensemble Learning with Curriculum-Based Masked Autoencoders for Retinal OCT Classification" Diagnostics 16, no. 2: 179. https://doi.org/10.3390/diagnostics16020179
APA StyleYoon, T., & Kang, D. (2026). Efficient Ensemble Learning with Curriculum-Based Masked Autoencoders for Retinal OCT Classification. Diagnostics, 16(2), 179. https://doi.org/10.3390/diagnostics16020179

