1. Introduction
Age-related macular degeneration (AMD) is a leading cause of severe visual impairment in people over 50 years of age [
1]. It affects 196 million people worldwide and this number is estimated to reach 288 million by 2040 [
1]. Adequate treatment requires early diagnosis. However, the disease is initially asymptomatic, which requires periodic ophthalmologic examinations of the retinal fundus [
2]. These examinations are mainly based on the analysis of fundus images, since they are the most cost-effective imaging modality [
3]. Fundus imaging allows detailed visualization of the retina’s macular region, where drusen deposits, pigmentary changes, and neovascularization—the hallmarks of AMD—typically appear. The affordability and ease of acquisition of these images make them an optimal screening tool for population-wide monitoring of AMD progression.
Given the high prevalence of AMD and the lack of specialists trained to diagnose it, automatic artificial intelligence algorithms have proven useful for the screening of the disease [
3]. Such automated systems are particularly valuable in rural or under-resourced regions, where access to expert ophthalmological evaluation is limited. They can be integrated into teleophthalmology platforms, allowing retinal images acquired in primary care centers to be remotely graded by AI-based systems. These algorithms can also support large-scale screening campaigns, reduce diagnostic delays, and help close healthcare gaps by ensuring that patients with early-stage AMD are identified and referred for timely management. In this way, automated AMD grading contributes to improving access to eye care and reducing preventable vision loss on a population scale.
Automatic retinal image analysis based on convolutional neural networks (CNNs) and supervised learning (SL) has been widely explored in the recent literature. OverFeat [
4,
5], Alexnet [
6], VGG-16 [
7], ResNet50 [
7,
8], ResNet101 [
3] and EfficientNetB4 [
3] architectures have been individually explored. Tan et al. [
9] developed a 14-layer ad hoc architecture as a fast, portable solution. A previous comparison among individual CNN architectures showed that the ResNetRS [
10] architecture achieved the highest performance for AMD detection [
11]. However, ensemble strategies are gaining special attention for AMD grading. For example, Grassman et al. [
12] combined AlexNet, VGG16, GoogLeNet, Inception-v3, ResNet101 and InceptionResNet-v2 to build a 13-class grading system. Govindaiah et al. [
13] combined Inception-ResNet-V2 and Xception architectures to classify AMD into 4 levels. The DeepSeeNet model consists of 3 CNNs based on the Inception-v3 architecture and can classify 6 degrees of AMD [
14]. It is also worth noting that the 5 best-ranked teams in the ADAM challenge proposed ensemble CNN strategies [
3]. These ensemble strategies leverage the strengths of multiple models to provide a more robust and generalizable classification, which is especially important in a task with subtle inter-class differences, such as AMD grading. In real-world scenarios, ensemble models often outperform single-model approaches by mitigating biases, reducing variance, and improving overall confidence in predictions. This is particularly relevant in clinical diagnostics, where uncertainty may lead to inappropriate management.
Despite the success of CNNs, the emergence of the vision transformer (ViT) architecture has also revolutionized image classification models [
15]. ViTs have demonstrated an exceptional ability to capture long-range dependencies and global context in images by replacing traditional convolutional layers with attention-based mechanisms. In this context, a foundation model for retinal images has recently been released under the name of RETFound [
16]. This model is based on the ViT architecture and was trained with self-supervised learning (SSL) using unlabeled fundus images. RETFound has proven effective for the diagnosis of various retinal conditions, including diabetic retinopathy and glaucoma, via fine-tuning on task-specific retinal image datasets. In the original work, RETFound achieved area under the ROC curves (AUROCs) of approximately 0.90–0.97 in ocular disease classification tasks, outperforming standard supervised networks pretrained on ImageNet. Its pretraining strategy allows it to extract meaningful features from massive amounts of data without requiring manual labeling, which is particularly useful in medical imaging scenarios where expert annotation is expensive and time-consuming. Moreover, RETFound represents a paradigm shift in the development of medical artificial intelligence (AI): the use of large-scale self-supervised pretraining tailored to domain-specific data, followed by fine-tuning for specific diagnostic tasks.
The main motivation of this work stems from the need for automated AMD grading systems that are both accurate and generalizable across diverse imaging conditions and acquisition devices. While existing CNN-based methods have achieved competitive performance, they often struggle to generalize due to limited representational diversity. Conversely, transformer-based models such as RETFound capture broader contextual information but may lack sensitivity to fine-grained retinal features. By integrating both architectures through an ensemble strategy, we aim to leverage their complementary strengths—achieving more robust and clinically reliable predictions. This hybrid approach not only improves performance metrics over individual models but also enhances interpretability and transferability to independent datasets, addressing key limitations identified in prior studies.
Preliminary experiments with individual architectures confirmed that ResNetRS outperformed other CNNs in accuracy, while RETFound achieved competitive generalization through its self-supervised pretraining. Given the individual success of both mentioned approaches (CNN-SL and ViT-SSL), we hypothesize that their ensemble would benefit from a higher performance. This hypothesis is grounded in the expectation that the two types of models, trained with different learning paradigms and architectural principles, will offer diverse yet complementary feature representations. Our objective was to build an ensemble model by mixing a deep CNN and RETFound for the automatic grading of AMD using fundus images. The great difference in the architectures (CNN vs. ViT) and the type of learning (SL vs. SSL) suggests that the individual models can provide complementary features for the classification task. Combining such heterogeneous models not only enhances performance but may also improve robustness across varied datasets and imaging conditions. To the best of our knowledge, these types of models have never been ensembled for retinal analysis. This work, therefore, contributes a novel hybrid strategy to the field of automated ophthalmologic diagnosis. By investigating the synergy between CNN-based and transformer-based feature extraction, we aim to bridge the gap between legacy vision models and emerging large-scale AI in medical imaging. The main contributions of this work can be summarized as follows:
- (1)
We propose a novel ensemble model that combines a CNN architecture (ResNetRS) trained with SL and a transformer-based foundation model (RETFound) trained with SSL for AMD grading using fundus images.
- (2)
To the best of our knowledge, this is the first study to combine RETFound with a CNN model in an ensemble framework, demonstrating their complementary feature representations.
- (3)
The proposed approach was evaluated using the largest publicly available AMD dataset (AREDS) and externally validated on an independent clinical dataset, showing improved generalization and robustness.
- (4)
The results outperform previously published methods in terms of quadratic weighted kappa (QWK), confirming the effectiveness of the ensemble strategy for ordinal AMD classification.
- (5)
The study contributes a generalizable framework that can be extended to other retinal diseases and supports the development of clinically reliable AI-assisted diagnostic tools.
3. Results
Results for the individual models and the ensemble model were obtained in terms of accuracy (ACC), sensitivity (SE), precision (PR), F1-score and quadratic weighted kappa (QWK) using the test set of the AREDS dataset (15,003 images). The ensemble model was also evaluated on the test set of the private dataset (420 images). As shown in
Table 1, the ResNetRS model surpasses the RETFound model, and the ensemble model performs better than both the individual ones. These results confirm that the ensemble strategy successfully integrates the strengths of the individual networks, leading to a more effective classification.
Although the ensemble model achieved only a modest improvement in accuracy compared to ResNetRS (+0.86%), it produced a higher QWK value (+0.008), which is particularly meaningful for ordinal classification tasks such as AMD grading. Moreover, the ensemble reduced variability between classes, increasing sensitivity for Intermediate and Late AMD while maintaining high precision. These results confirm that the ensemble provides complementary information that enhances the model’s clinical reliability beyond simple accuracy gains.
In
Figure 2, we show the confusion matrix for the test set of AREDS using the proposed ensemble model. As observed, the class “Early AMD” is frequently misclassified as “No AMD”. This misclassification likely results from the subtle visual differences between early AMD and the healthy retina, which can be challenging even for human experts to detect. Drusen, for example, may be small and sparse in early stages, making them less visible depending on image quality or lighting. The other 3 classes are classified with a notably better performance, as visually suggested by the darker diagonal entries in the confusion matrix for those categories.
To further understand the performance distribution, we analyzed the per-class metrics, as shown in
Table 2. The ensemble model achieved the following SE values on the AREDS test set: 84.71% for ‘No AMD’, 28.37% for ‘Early AMD’, 70.61% for ‘Intermediate AMD’, and 76.38% for ‘Late AMD’. The particularly low sensitivity in the ‘Early AMD’ category confirms the confusion shown in the confusion matrix. On the other hand, the high sensitivity for ‘No AMD’ and ‘Late AMD’ supports the model’s capacity to distinguish between the extremes of the disease spectrum.
PR values followed a similar pattern (
Table 2): the model was most precise in identifying ‘Late AMD’ and least precise in ‘Early AMD’. These results suggest that the ensemble approach is particularly well-suited for recognizing clear pathological changes but less reliable for subtle indicators of disease onset.
Notably, the ensemble model achieved its highest overall performance when evaluated on the private dataset, reaching a QWK of 0.8046. This may be attributed to better image quality, more consistent labeling, or reduced class imbalance. The performance boost on this independent set supports the model’s generalizability, which is crucial for real-world deployment.
4. Discussion
In this work, we have developed an ensemble model for fundus image analysis to aid in AMD grading. Our approach combines 2 individual models of a very different nature: (1) a CNN with ResNetRS architecture based on SL, and (2) a network with ViT architecture based on SSL (RETFound). To the best of our knowledge, this is the first time that RETFound has been combined with a CNN model in an ensemble fashion for AMD grading.
The decision to implement a hybrid ensemble was based on empirical observations showing that standard pre-trained models, although effective individually, exhibited different error patterns. CNNs such as ResNetRS captured detailed drusen and pigment changes, whereas RETFound focused on broader retinal structures. By combining their outputs through a meta-learner, the ensemble effectively merged local and global feature representations, leading to improved consistency across datasets. This supports the theoretical rationale for hybrid modeling in retinal image analysis.
For our experiments, we used the largest dataset of fundus images publicly available in the context of AMD. Additionally, our approach was tested on an independent private set. This comprehensive evaluation provides insights into both the model’s internal validity and its external generalizability.
Results in
Table 1 show that the ensemble model outperforms the individual models in almost every metric, achieving ACC = 66.03%, SE = 66.03%, PR = 68.20%, F1-score = 0.6510 and QWK = 0.7364. The incremental quantitative gains of the ensemble should be interpreted in the context of their qualitative impact. The hybrid model achieved more stable predictions across disease stages, showing fewer extreme misclassifications (e.g., Early vs. Late AMD) and improved consistency between datasets. As we hypothesized, the combination of ResNetRS and RETFound allowed us to reduce the prediction error and improve the model generalization. The results obtained using the private dataset were also consistent, with ACC = 70.95%, SE = 70.95%, PR = 66.71%, F1-score = 0.6651 and QWK = 0.8046. These results show an effective adaptation of the proposed method to independent datasets. It is worth noting that the ensemble was not merely averaging predictions but rather learning from the outputs of both models through a dedicated meta-learner, which enhances the decision process.
When comparing the individual models with each other, the ResNetRS model shows a higher performance than the RETFound model. Despite the promising results of the recent approach ViT-SSL, this work supports that a modern CNN architecture, such as ResNetRS, together with transfer learning can achieve a superior performance for AMD classification. Nonetheless, the RETFound model still contributes meaningful representations to the ensemble, especially in ambiguous cases where spatial attention becomes crucial.
The improvement observed in the private dataset is particularly promising. In real-world clinical deployments, models are often exposed to data distributions that differ significantly from those used in training. Thus, a method that generalizes well across centers, devices, and patient demographics is highly desirable. Our findings suggest that the ensemble model is capable of transferring knowledge effectively between domains.
Our results can be compared to those obtained in the literature for AMD grading, as shown in
Table 3. However, the comparisons should be made with caution since the number of target classes and the specific datasets are different among studies. In the work of Burlina et al. [
6], a high ACC = 83.20% was obtained, but QWK reached 0.6540, which is lower than our QWK = 0.7364. Like our study, they used the AREDS dataset and considered 4 AMD degrees. However, the provided results were computed applying a 5-fold cross validation over a reduced subset of 67,401 images. Peng et al. also used the AREDS dataset [
14]. Nonetheless, they selected 38,884 images for testing and classified AMD into 6 severity levels. Their ACC = 67.10% performed similarly to ours (ACC = 66.03%), while their QWK = 0.5580 was considerably lower than the one obtained with our approach (QWK = 0.7364). At the end, QWK is the most representative metric for multiclass classification [
22]. Our higher QWK indicates that the predictions of our model more closely align with the ordinal structure of the ground truth labels, making the system more trustworthy in clinical applications. Although direct comparisons cannot be made, we can conclude, in general terms, that our ensemble model achieves a higher performance than the previous studies.
Although the proposed ensemble achieved an overall accuracy of 66.03% and a QWK of 0.7364, these results should be interpreted in the context of AMD screening rather than as a final diagnostic decision. In clinical terms, a QWK between 0.61 and 0.80 is considered indicative of substantial agreement with expert grading [
24], suggesting that the model can reliably support ophthalmologists in large-scale screening programs. Most misclassifications occurred between “No AMD” and “Early AMD,” categories that are often difficult to distinguish even for human experts and have limited therapeutic implications. Conversely, the model showed strong performance for the detection of “Intermediate” and “Late AMD,” which are the stages requiring closer clinical follow-up or treatment. This behavior indicates that potential risks from misclassification are minimal from a clinical management standpoint.
Regarding computational complexity, the ensemble combines ResNetRS (≈200 k parameters) and RETFound (≈300 k parameters), for a total of nearly 500 k parameters. After both base models are trained and frozen, only the meta-learner remains trainable, consisting of 82 parameters. The ensemble was trained on an NVIDIA GeForce RTX 4080 GPU (batch size = 16) with an average training time of about 21 min per epoch. Once deployed, the model performs inference in under one second per image, confirming its efficiency for large-scale screening scenarios. These results show that the proposed hybrid ensemble achieves high accuracy without imposing substantial computational demands, supporting its practicality for research and potential clinical use.
The present work has limitations that should be mentioned. First, the images from AREDS were collected between 1992 and 2004 using the cameras available in that moment. Therefore, they show overall lower quality and resolution than images obtained with modern cameras. This is the reason why our validation procedure required a fine-tuning stage using part of the private dataset. In the future, it would be desirable to improve the robustness of the model for modern fundus images and avoid the need for adjustments. Moreover, the datasets used in this study—although representative—may not fully reflect the diversity of acquisition protocols, ethnic backgrounds, or disease manifestations found in real-world clinical settings. External validation on larger, multi-center datasets will be essential to confirm the generalizability of the proposed approach. Second, the model showed relatively low sensitivity for the detection of Early AMD, which represents an important clinical limitation since early identification is critical for effective prevention and monitoring. This reduced sensitivity is likely related to the subtle and often ambiguous visual signs of early disease, which are difficult to distinguish even for expert graders. Future work should focus on improving the detection of early-stage features by incorporating higher-resolution data, targeted augmentation, or attention mechanisms. Third, although the present study demonstrates the potential of the proposed hybrid ensemble, we recognize that a comprehensive ablation study comparing multiple ensemble strategies and recent hybrid architectures would provide further insight. Future research will address this aspect in detail, exploring how different combinations of CNN and transformer-based models, as well as alternative fusion strategies, affect performance and interpretability. Ablation studies would allow us to determine an optimized combination of the individual models. Fourth, the computational cost of the ensemble was only partially characterized. While preliminary figures were reported, a complete benchmarking of computational metrics would be valuable to evaluate the trade-offs between model complexity and accuracy. Finally, the present study lacks interpretability in the predictions. The use of explainable artificial intelligence (XAI) would help better comprehend the contribution of the individual models to the ensemble. In addition, a XAI analysis would increase the confidence in the model in a clinical setting. Future work may integrate error analysis and techniques such as Grad-CAM or attention rollout to highlight the retinal regions that influence predictions, thus promoting clinical interpretability. Moreover, the current study did not include a prospective assessment of the model’s real-world impact in terms of clinical workflow or workload reduction. Such evaluations, combined with XAI visualizations, will be necessary to ensure that the system can be safely and effectively integrated into ophthalmic practice.