Next Article in Journal
Optimization of Water Tank Shape in Terms of Firefighting Vehicle Stability
Previous Article in Journal
Smart Hydroponic Cultivation System for Lettuce (Lactuca sativa L.) Growth Under Different Nutrient Solution Concentrations in a Controlled Environment
Previous Article in Special Issue
Multimodal AI and Large Language Models for Orthopantomography Radiology Report Generation and Q&A
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Twin Self-Supervised Learning Framework for Glaucoma Diagnosis Using Fundus Images

by
Suguna Gnanaprakasam
and
Rolant Gini John Barnabas
*
Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Coimbatore 641112, India
*
Author to whom correspondence should be addressed.
Appl. Syst. Innov. 2025, 8(4), 111; https://doi.org/10.3390/asi8040111
Submission received: 22 May 2025 / Revised: 1 August 2025 / Accepted: 4 August 2025 / Published: 11 August 2025

Abstract

Glaucoma is a serious eye condition that damages the optic nerve and affects the transmission of visual information to the brain. It is the second leading cause of blindness worldwide. With deep learning, CAD systems have shown promising results in diagnosing glaucoma but mostly rely on small-labeled datasets. Annotated fundus image datasets improve deep learning predictions by aiding pattern identification but require extensive curation. In contrast, unlabeled fundus images are more accessible. The proposed method employs a semi-supervised learning approach to utilize both labeled and unlabeled data effectively. It follows traditional supervised training with the generation of pseudo-labels for unlabeled data, and incorporates self-supervised techniques that eliminate the need for manual annotation. It uses a twin self-supervised learning approach to improve glaucoma diagnosis by integrating pseudo-labels from one model into another self-supervised model for effective detection. The self-supervised patch-based exemplar CNN generates pseudo-labels in the first stage. These pseudo-labeled data, combined with labeled data, train a convolutional auto-encoder classification model in the second stage to identify glaucoma features. A support vector machine classifier handles the final classification of glaucoma in the model, achieving 98% accuracy and 0.98 AUC on the internal, same-source combined fundus image datasets. Also, the model maintains reasonably good generalization to the external (fully unseen) data, achieving AUC of 0.91 on the CRFO dataset and AUC of 0.87 on the Papilla dataset. These results demonstrate the method’s effectiveness, robustness, and adaptability in addressing limited labeled fundus data and aid in improved health and lifestyle.

1. Introduction

Glaucoma is a disorder where the optic nerves get damaged, leading to vision loss if not treated in time. It is often due to increased eye pressure, called intraocular pressure [1]. Although the real cause of glaucoma is not clearly known, some factors like aging, family, and heredity are closely related to the development of glaucoma. Age is one of the important factors; as one ages, the possibility of getting glaucoma, as well as the deterioration rate of glaucoma, also increases. Likewise, family history, high blood pressure, diabetes, and nearsightedness can enhance the risk of glaucoma. Regular screening and early diagnosis are crucial for improved treatment and care [2]. The major problem in this field is that glaucoma can occur and progress without obvious symptoms, making it difficult to realize until significant vision loss occurs. Therefore, accurate and non-invasive diagnostic tools are essential for early detection of glaucoma. Fundus imaging is a commonly used retinal imaging modality to detect glaucoma that is simple and cost-effective [3]. The appearance of the optic nerve changes is the first and basic indication of being affected by glaucoma. Physicians must rely on regular eye exams and various diagnostic procedures to identify the noticeable changes in the optic nerve condition. Retinal Nerve Fiber Layer (RNFL) and Neuro Retinal Rim (NRR) loss are the other Optic Nerve Head (ONH) changes. In addition to fundus examination of optic nerve changes of the eye using fundus images, other diagnosis methods are also used, like tonometry, perimetry, gonioscopy, visual field testing, pachymetry, and Optical Coherent Tomography (OCT). These techniques complement each other and are combined as necessary for decision-making.
Insufficient time and resources due to manual annotation, and relying on expert experience are also a few issues in the identification and management of glaucoma [4]. Computer-Aided Diagnosis (CAD) technology assists medical professionals in accurately and efficiently diagnosing glaucoma. The use of handcrafted features of the ONH changes in CAD systems can lead to inaccurate measurements due to the imprecise segmentation of these structures. The inaccurate information generated in this process has a cascading effect on the performance of diagnosis systems. This highlights the need to utilize deep learning methods in CAD systems for these applications.
Deep learning (DL) methods in CAD systems play an important role in achieving promising results [5,6]. These algorithms can learn even subtle patterns related to the condition by training on large datasets of fundus images, enabling them to identify even mild indicators of glaucoma and help with early diagnosis and treatment. This demands a diverse dataset representing different phases, presentations, and a range of symptoms covered by a dataset of glaucoma, like wide variations in symptoms and signs among patients. Collecting “labeled fundus images” from hospitals, clinics, or any research organization is difficult due to the insufficient quantity but are essential for training deep learning algorithms [7].
On the other hand, unlabeled fundus images are plentiful in hospitals and easy to collect due to their volume and accessibility. However, labeling or annotating these huge unlabeled fundus images for supervised learning is hard due to the quantity and time requirements of skilled professionals [8]. The lack of labels associated with most of these images makes it challenging to use supervised learning for recognizing glaucoma diseases, as the identification of irregularities in the fundus images is more difficult. Unsupervised learning algorithms are helpful in these types of unlabeled images to discover hidden patterns, correlations, or structures that are impossible in manual detection by professionals not highly trained [9]. However, the quality and relevance of the unlabeled images may vary and may require some preprocessing techniques in some cases before applying deep learning algorithms.
A possible way to use unlabeled fundus images in the case of supervised tasks is to generate pseudo-labeling. Pseudo-labeling is a method where a model is trained with labeled images and then utilizes the trained model to predict the labels for unlabeled images or iteratively learn from its prediction, which is called self-training. These predicted labels from this kind of training are known as pseudo-labels, which are then used for the diagnosis/detection of disease as true labels for unlabeled fundus images. The main challenge in pseudo-label generation is the consistency and correctness of the prediction [10]. This depends on the performance of the initial model; sometimes, it may not be completely accurate, and challenging for uncertain cases. To tackle this issue, various techniques like setting a threshold on predicted probabilities, confidence prediction [11,12], combining pseudo-labeling with active learning [13,14,15], and combining fine-tuning with semi-supervised learning or self-supervised learning have been suggested [16].
Self-supervised learning and semi-supervised learning methods have garnered interest and yielded significant results in various fields, including clinical applications. Self-supervised learning allows artificial intelligence (AI) models to pre-train on large-scale unlabeled images with auxiliary labels in a supervised manner, significantly boosting performance on downstream tasks of the model and mitigating the problem of the unavailability of labeled images [17], which hinders the progress in glaucoma detection. On the other hand, semi-supervised learning integrates both labeled and unlabeled images to create a more robust and representative training set. It starts by using unlabeled images to identify patterns in an unsupervised manner for generative tasks, a process also referred to as self-supervised learning, and then further trains on labeled images for classification tasks [18].
Pseudo-labeling [19] is also another useful strategy that can be used as part or in combination with self-supervised or semi-supervised learning for leveraging unlabeled images to uncover and comprehend their inherent structure or information. In this approach, the model makes predictions on the unlabeled data to create pseudo-labels, which are then fused with the original labeled images for retraining. This process allows the model to benefit from the extra insights found in the unlabeled data and improve the model’s performance [16,20]. The undivided focus on pseudo-labeling may open opportunities for further research and advancements in creating methods that effectively leverage unlabeled images in various applications, especially in scenarios like glaucoma detection where obtaining labeled data is challenging. Traditional supervised learning relies solely on labeled data for model training, necessitating a substantial amount of data that can be time-consuming and expensive to generate; whereas, in many real-world circumstances, vast amounts of unlabeled images are frequently available. Finding an effective way to utilize the unlabeled data may be an optimized way of addressing this problem.

The Contribution of This Work

This part explains the requirements of an effective method for glaucoma detection and gives insights into how important it is to find ways of using the hugely available unlabeled data to solve the problem on hand. There is a method proposed in this work and its contributions are as follows. The proposed work employs a twin self-supervised strategy as follows: (i) An exemplar Convolutional Neural Network (CNN) discriminative feature learning to construct pseudo-labels, (ii) semi-supervised learning using a Convolutional Auto-Encoder (CAE) to overcome the limited availability of labeled images in detecting glaucoma. The proposed work’s contributions are the following:
  • Addressing the challenge of limited annotated fundus images: the work creates a twin model (semi-supervised and self-supervised) with a CAE and exemplar CNN method for glaucoma detection that uses both labeled and unlabeled fundus images and addresses the barrier of limited labeled fundus images.
  • Generating pseudo-labels for unlabeled fundus images: to improvise the detection of glaucoma using unlabeled images, self-supervised learning (exemplar CNN) is used in the proposed method to generate pseudo-labels that learn the beneficial features and the inherent structure of the image better by itself, without a need for explicit labels or supervision.
  • Exemplar CNN method trained for effective and enhanced feature learning: by focusing on curating different transformed patches, which contribute to the synthetic classes, the CNN is then trained to differentiate between the synthetic classes, allowing it to acquire distinguished characteristics from the unlabeled images leading to improved performance in glaucoma detection.
  • Validation of effectiveness of the usage of the unlabeled fundus image: proposes a model that can effectively employ unlabeled images for training and further improve its performance by producing pseudo-labels. The model’s performance and learning improvement are verified with increased accuracy, sensitivity, etc., in glaucoma detection using unlabeled fundus images.
  • Adaptability and robustness to changes in dataset size: performance analysis was carried out by varying the dataset size over small to large datasets to test the proposed model’s adaptability and requirements.
  • Adaptability and robustness over varied datasets: the improvement in performance shows the adaptability and robustness of the proposed method by applying it to a variety of datasets’ fundus images.
Overall, this work’s contributions address the challenge of limited labeled images, effectively use unlabeled images through semi-supervised and exemplar CNN methods, and enhance the performance of the glaucoma diagnosis model via two-way self-supervised learning methods. Furthermore, by purposefully ignoring the labels of public datasets, it hopes to recreate a scenario in which a huge volume of unlabeled data availability (as is frequently the case in real-world scenarios) to ensure the model’s correct prediction. This method enables the evaluation of the efficacy of unsupervised or semi-supervised learning techniques by utilizing large amounts of unlabeled data to improve the performance of CAD systems.
The rest of this paper is organized as follows: Section 2 provides an overview of the related works in diagnosing glaucoma. Section 3 elaborates on the datasets used, and Section 4 explains the proposed method of twin self-supervised learning and its training process. Section 5 presents the proposed experimental findings and concludes the outcomes by drawing observations and discussing various systems’ performances. Finally, Section 6 concludes the proposed work and discusses the future possibilities.

2. Related Works

This section analyses the progress made in the detection of glaucoma to date by various research with only small-labeled fundus images, as well as utilizing the largely available unlabeled images. Section 2.1 covers the research carried out based on supervised learning methods with small-labeled datasets. Section 2.2 discusses the work carried out based on semi-supervised learning methods for glaucoma detection using unlabeled fundus images. Section 2.3 outlines the research based on pseudo-label generation methods.

2.1. Supervised Learning

Few researchers have addressed the problem of limited labeled data in glaucoma diagnosis by utilizing the knowledge gained from the pre-trained models through transfer learning. Transfer learning entails freezing the majority if not all the layers of the pre-trained model, at least fine-tuning the last few layers specific to the problem [21]. Table 1 summarizes the supervised learning on glaucoma detection and compares the different datasets and models employed in transfer learning. To investigate the impact of dataset size on pre-trained CNN model training, the work by [22] suggested a pre-trained model approach using DENet, VGG19, GoogLeNet, and ResNet-50. Also, the work covers different experiments (hyper-parameter selection, comparison of different CNN algorithms, 10-fold cross-validation, comparison of the performance between expert and non-expert evaluators, the influence of dataset in performance and integration of medical data with CNN) on the datasets of RIM-ONE, Drishti-GS, and ESPERANZA to quantitatively analyze the contribution of various elements of the model, including the choice of architecture, training plans, and training dataset. Among the models, VGG19 outperforms, yielding an area under the curve (AUC) of 0.94 and a balanced accuracy of 88.05%. Work by [23] investigated the viability of utilizing pre-trained CNNs for glaucoma detection, especially with a very small number of images. The study used the high-resolution fundus (HRF) database, a broadly utilized benchmark dataset within the field. The work shows that among different pre-trained CNN structures explored, the VGG16 arrangement shows predominant appropriateness for glaucoma identification.
Work by [24] used a comprehensive collection of 1563 fundus images to classify healthy and glaucoma cases. Eight different deep learning neural networks such as VGG16, Inception-v3, EfficientNetB7, ResNet-152, DenseNet-201, NASNetMobile, ResNet-50, and ResNet-101V2 are implemented for glaucoma detection. Precision and recall were the focus of this study, and the Inception-v3 model seemed to perform better for glaucoma image classification. In [25], the authors proposed a transfer learning model called EyeNet, which was originally trained to detect diabetic retinopathy. The model is fine-tuned using the RIM-ONE fundus dataset for glaucoma detection and achieved 87% accuracy. Work by [26] explored a variety of deep learning algorithms using multiple datasets. They trained their models on ACRIMA, ORIGA, and HRF datasets, and evaluated their performance using the private and Drishti-GS datasets. Their approach involved employing transfer learning with pre-trained models. Work by [27] employed data augmentation to enhance fundus images from the Drishti-GS, ORIGA, RIM-ONE, and G1020 datasets for better detection. They utilized transfer learning with the ResNet-50 model, achieving high-performance metrics on the G1020 dataset: 98.48% accuracy, 99.30% sensitivity, 96.52% specificity, 0.97 AUC, and 98% F1 score.

2.2. Semi-Supervised Learning

In semi-supervised learning strategies, some work utilizes only unlabeled data for unsupervised learning to extract useful features from images. Unsupervised feature extraction helps models learn from unlabeled data and then fine-tune for supervised classification. In certain cases, synthetic data were created using models like Auto-Encoder (AE) and Generative Adversarial Network (GAN), leveraging unlabeled data in an unsupervised learning process. This approach artificially enhances the training dataset’s diversity. Additionally, artificial data augmentation techniques, including image modifications and perturbations, are applied to unlabeled data. Synthetic data enhances the labeled dataset by providing more training samples, leading to better generalization and classification, especially with small original datasets. Some of these kinds of work are discussed in this section.
Table 2 provides an overview of semi-supervised learning for glaucoma detection and compares the various datasets and models used. Bechar et al. [28] proposed a super-pixel-based semi-supervised segmentation algorithm that employs a robust co-forest classifier trained using a limited number of annotated super-pixels from the optic cup and disc regions alongside multiple unlabeled super-pixels from the same areas. A comparison of expert opinion and the semi-supervised technique outcome proposed by this work for glaucoma diagnosis yielded an accuracy of 90.8% for glaucoma identification.
Pal et al. [29] proposed a multi-model network called G-EyeNet for detecting glaucoma from retinal fundus images. A deep CAE is used in an unsupervised manner to learn the hidden features of fundus images, and a softmax classifier is used for glaucoma classification in this network. Based on a multi-task learning approach, the multi-model network is jointly tuned to minimize image reconstruction and classification errors. The encoder framework’s unsupervised training aids in learning a suitable input distribution, which aids in categorization, and it results in an AUC of 0.92 on the DRIONS-DB dataset.
Raghavendra et al. [30] showed the effective use of cascaded sparse AE to detect glaucoma. The two sparse AEs are initially trained separately for unsupervised learning to learn the characteristics of fundus images. One softmax classifier is also prepared for glaucoma detection with supervised learning. Then, these three components are cascaded. The features extracted from the first AE are fed as input to the second AE, and the features extracted by the second AE are fed into the classifier at the third stage, where glaucoma classification with labeled data was performed using a softmax layer, yielding an accuracy of 95%.
Diaz-Pinto et al. [31] proposed two different systems. The first system, called the Deep Convolutional Generative Adversarial Network (DCGAN), serves solely as an image synthesizer, generating synthetic fundus images based on an adversarial strategy. The second system, known as the semi-supervised DCGAN (SS-DCGAN), functions both as a semi-supervised image synthesizer and as a classifier for glaucoma. The training of the SS-DCGAN utilized 1650 labeled fundus images and 84,569 unlabeled images. The resulting discriminator/classifier from the SS-DCGAN was employed for glaucoma classification, achieving an AUC of 0.90.

2.3. Pseudo-Label Generation

The pseudo-label generation approach entails learning intrinsic data representations to predict labels for unlabeled samples. This method, which mixes unsupervised and semi-supervised learning, is an effective strategy for dealing with data shortages and enhancing model performance. Very few works have focused on pseudo-label generation and utilized these pseudo-labels for glaucoma detection, and there is still room for more exploration and emphasis on this approach, which can take the detection to a way better level.
A summary of pseudo-label-generation-based research for glaucoma detection is in Table 3, which also compares the different datasets and models employed. Al Ghamdi et al. [12] proposed a two-stage network with both supervised and unsupervised methods. In the first stage, the pre-trained models are trained with labeled data for glaucoma detection. In the second stage, self-learning is used with unlabeled data. The pseudo-labels are generated for the unlabeled data from this trained model. The selected unlabeled data based on the high score of pseudo-labels are combined with the labeled data for the subsequent training. The retinal fundus image from RIM-ONE r2 dataset are used for labeled data, and the RIGA dataset is used for unlabeled data. This semi-supervised model is used to achieve an accuracy of 92.4% for the RIM-ONE r2 dataset.
Alghamdi and Abdel-Mottaleb [32] examine three glaucoma detection models: Transfer Convolutional Neural Network (TCNN), Semi-Supervised Convolutional Neural Network with self-learning (SSCNN), and SSCNN with AE. The VGG16 model is utilized in the TCNN framework with transfer learning for glaucoma detection. The SSCNN is a two-stage model. In the first stage, it uses the same model as TCNN. In the second stage, the training dataset is expanded with unlabeled fundus images and pseudo-labels predicted through a self-learning method from TCNN. In SSCNN-DAE, a Denoising Auto-Encoder (DAE) with a convolution network is used and trained in an unsupervised manner to learn the features of fundus images. The encoder of this, combined with the softmax layer, is trained for glaucoma classification using labeled fundus images. SSCNN-DAE outperformed the other two models tested in this work in terms of accuracy, demonstrating the usefulness of the combined feature extraction and supervised classification strategy in a semi-supervised scenario and achieving the highest accuracy of 93.8%. Fan et al. [33] introduced a multi-task Siamese network for semi-supervised learning, incorporating a self-training approach to augment training with pseudo-labels. This method achieved an accuracy of 90.2%.

2.4. Motivations from the Related Work

The observations made from the survey conducted on glaucoma diagnosis using three learning strategies, namely supervised learning, semi-supervised learning, and pseudo-label generation, are concisely pictured in this section. Enhancing pre-trained algorithms with the addition of large unlabeled fundus datasets can lead to advances in glaucoma diagnosis. This semi-supervised learning approach can enhance a model’s decision-making capability, generalization, and adaptability by learning richer, diverse representations from unlabeled fundus data. After enhancing pre-trained models, it can be used to generate pseudo-labels for unlabeled fundus data in the target (e.g., glaucoma detection) domain. Furthermore, this approach of using unlabeled data permits the building of a model with custom hospital data, reducing the need to rely on the presence of publicly available labeled data (which does not contain accurate labeling while analyzing the annotations) [34]. Additionally, the research community is beginning to acknowledge unlabeled data as a useful tool for discovering advantageous patterns in fundus images, particularly for feature extraction in glaucoma detection. Techniques such as representation learning using AEs are used to extract information from unlabeled fundus datasets. This enhances model capabilities in classification within a semi-supervised framework when labeled data is limited. Furthermore, little insight has been derived from the discussed works (Table 3) on strategies for generating pseudo-labels from unlabeled fundus images and using them in conjunction with labeled data to develop models.
The proposed work distinguishes itself by merging semi-supervised learning and pseudo-label generation in a new manner, cascading the two self-supervised learning approaches, one that generates the pseudo-labels for unlabeled fundus images, and another one that utilizes these pseudo-labeled fundus images, along with labeled fundus images, to train a classifier (part of the self-supervised learning framework) for glaucoma detection. This approach aims to improve diagnostic performance for glaucoma detection by improving the model’s ability to learn meaningful representations from unlabeled fundus images and also from the expanded training dataset.

3. Materials

Figure 1 shows the sample fundus images of normal (Figure 1a,c) and glaucoma cases (Figure 1b,d) where the shape and color of the optic nerve head show damage or cupping in terms of enlargement of cup–disc ratio (Figure 1b,d); a and b from Drishti-GS [35], and c and d from ORIGA [36].
Table 4 gives the details of the fundus dataset considered in the proposed work, and a portion of this dataset is considered as the smaller dataset and denoted as S 1 in this entire proposed work for pseudo-label generation. Table 5 gives the details of the fundus datasets considered for pseudo-label generation in the case of larger datasets, denoted as L 1 in this entire proposed work. The proposed method not only checks the effectiveness of the pseudo-label generation technique in detecting glaucoma but also tests the effectiveness of the change in the sizes of the dataset considered while generating the pseudo-label.
The dataset D f u n d u s is created by combining two sources: Drishti-GS [35] which includes 101 images (70 with glaucoma affected and 31 normal), and ORIGA [36] which includes 650 images (168 with glaucoma affected and 482 normal). This results in a unified dataset of 751 images, offering broader and potentially diverse images for evaluating the model’s performance when a small amount of data is available. The merged dataset ( D f u n d u s ) contains 238 images of glaucoma and 513 images of normal cases. Portions of this merged dataset have been considered as a small unlabeled fundus dataset S 1 for pseudo-label generation in this work. A larger merged fundus dataset comprising 8169 images [37,38,39,40,41,42,43] denoted as L 1 in this proposed work, shown in Table 5, is completely used for the pseudo-label generation; serving as the large dataset for this work. Both small and larger merged fundus datasets are preprocessed using resizing and standard scalar normalization to standardize across the datasets.

4. Proposed Method

This section details the proposed method. As illustrated in Figure 2, the proposed work can be divided into two self-supervision methods. Self-supervision 1 demonstrates the creation of pseudo-labels through feature learning via an exemplar CNN approach. Self-supervision 2 involves training an unsupervised CAE to extract features from the trained encoder, which is provided by the fused fundus images (pseudo-labeled, created in self-supervision 1, and labeled fundus images). This training is then used to train the Support Vector Machine (SVM) classifier with these features for glaucoma detection.
The acquired dataset D f u n d u s is split into 80% unlabeled data D u n l a b e l e d and 20% labeled data D l a b e l e d   o r   D L for all kinds of training and testing for glaucoma detection in the proposed model. The training and testing sets have remained constant throughout all glaucoma detection experiments in the proposed method. The unlabeled data D u n l a b e l e d is then split into two portions each of 40% unlabeled data D u 1   &   D u 2 , and the small portion ( D u 3 ) of D u 1 is considered for synthetic label creation. All separation was manually performed to ensure a clear disjoint between training and testing sets to prevent data leakage. A complete listing of image identifiers for each separation is presented in the Supplementary Material (Tables S1–S5). Manual selection also ensured balanced class distribution in all subsets. After the separation of D f u n d u s into different sets of D u 1 , D u 2 , and D u 3 , via manual stratified sampling to ensure the balance, the synthetic labeled data ( D u 3 ) is utilized to optimize the pre-trained CNN model and fine-tuned on labeled data ( D L ) for glaucoma detection. The tuned model generates pseudo-labels for the unlabeled data D u 2   o r   S 1 . This data separation is described in Algorithm 1 and detailed in the Supplementary Material, and the implementation of pseudo-label generation is shown in Figure 2, described in Steps 1 and 2 of Algorithm 2.
The unlabeled data D u 1 is used only for unsupervised CAE training for proper reconstruction of the fundus image by minimizing the Mean Square Error (MSE) [44]. After that, the CAE encoder model is frozen, as depicted in Figure 2 and described in Algorithm 3. The frozen encoder extracts features from the expanded set of fundus images (the labeled data D L and the pseudo-labeled data D P L ). These features are then input into an SVM classifier to differentiate between normal and glaucoma instances, as depicted in Figure 2 and described in Algorithm 3. The advantage of the SVM classifier is that it employs a kernel approach to effectively classify even non-linearly separable instances [45]. Additionally, compared with neural networks, SVMs have stronger generalization with small samples, making them more suitable for our limited dataset [46].
Algorithm 1: Data Set Separation
Input: Fundus dataset  ( D f u n d u s ) , Output: D L —Labeled data,  D u 1 ,   D u 2 ,   D u 3 —Unlabeled data
  •  Step 1: Dataset Separation (using manual stratified sampling)
  •         D f u n d u s → Labeled fundus images
  •          80% D f u n d u s D u n l a b e l e d
  •          20% D f u n d u s D l a b e l e d or D L ; // Supervised Learning–(Training set and testing set
                       for glaucoma detection)
  •          40% D u n l a b e l e d D u 1 ; // Unsupervised Learning–(Training set and Testing set for
                      reconstruction of images)
  •          40% D u n l a b e l e d D u 2   o r   S 1 ; // Pseudo Label Generation
  •         Small portion of D u 1 D u 3 ;  // Synthetic Label Generation–(Training and Testing set for
                      geometric transformation classification)
The proposed methodology consists of several modules, which are detailed in the following sections: pseudo-label generation module with an exemplar CNN discriminative learning, AE module in unsupervised and supervised learning, and cascaded module where expanded datasets are used to train an SVM classifier with a trained AE for glaucoma detection. Section 4.1 discusses the self-supervised learning approach. Section 4.2 explains how pseudo-labels are generated by self-supervised learning with the pre-trained model, utilizing an exemplar CNN feature learning method. Finally, Section 4.3 covers the detection of glaucoma using a self-supervised CAE model trained on the expanded fundus dataset.

4.1. Self-Supervised Learning

At present, self-supervised learning significantly contributes to unsupervised learning in medical applications, particularly due to the plentiful availability of unlabeled data. In a self-supervised model, the learning process typically consists of two stages, pretext task training and downstream task training, as shown in Figure 3. During pretext task training, the model learns to extract features via self-supervision using a large amount of unlabeled data, creating artificial labels for the pretext task by learning intrinsic features from the unlabeled data. Subsequently, the trained model from the pretext task is transferred to train the downstream task for glaucoma detection where labeled data is limited. The proposed method employs two self-supervised learning approaches to address the limited labeled fundus images in glaucoma detection. Figure 3 illustrates the basic self-supervised learning framework. In self-supervision 1, the pretext task is an exemplar CNN process, and the downstream task is glaucoma detection in the proposed work. In self-supervision 2, the pretext task is the reconstruction of fundus images using AE, and the downstream task is glaucoma detection in the proposed work. Through a self-supervised learning strategy, the models can extract the features of fundus images from the vast pool of unlabeled fundus images, shown as self-supervision learning 1 and 2 in Figure 2.

4.2. Pseudo-Label Generation in the Proposed Method

Figure 4 demonstrates the process of generating pseudo-labels of unlabeled fundus images for glaucoma detection, and the process is detailed in Algorithm 2. The pseudo-labels   y p s e u d o l a b e l for unlabeled fundus images of the D u 2 dataset are generated using the pre-trained ImageNet model, which was trained through a self-supervised process 1 and is illustrated in Figure 4, which corresponds to Stage 2 in Figure 2. In this work, the Exemplar CNN utilizes the ImageNet pretrained model as its backbone for feature extraction in the pretext task of self-supervised learning. Rather than directly fine-tuning the ImageNet model for transfer learning, it is first pre-trained to classify synthetic labels for enhanced feature understanding, followed by fine-tuning the softmax classifier for glaucoma detection, which is Stage 1 in Figure 2. Various pre-trained models were employed in the proposed work to predict the pseudo-labels of the fundus images for comparison. The process of self-supervised learning 1 is explained in this section.

4.2.1. Self-Supervision 1

The self-supervision process illustrated in Figure 4 shows the pretext and downstream tasks of self-supervised learning, as detailed in Steps 1 and 2 of Algorithm 2. The exemplar CNN is a pretext task of classifying various patches based on the undergone transformations and a downstream task of detecting glaucoma from fundus images. This represents the discriminative feature learning and the transfer learning process of Stage 1 in Figure 2 of the proposed method.
Pretext Task—Exemplar CNN Feature Learning
In the proposed method, an exemplar CNN discriminative feature learning approach [47] as a pretext task employed for a pseudo-label generation, it utilizes the pre-trained CNN model, as shown in Figure 4, with data augmentation (extensive and controlled data augmentation) and synthetic classes to perform unsupervised feature learning for glaucoma detection. This strategy aims to enhance the model’s performance by enabling it to learn from a larger pool of unlabeled data and diverse augmented patches. This kind of model is called an exemplar CNN model.
Step 1 of Algorithm 2 describes the pretext task of the discriminative feature learning approach. This section summarizes the workflow of this methodology, as shown in Figure 4. D u 3 comprises a collection of q number of unlabeled fundus images denoted as X i   u   w h e r e     i = 1 ,   2 , ,   q . Step 1.1 of Algorithm 2 explains the patch division process for discriminative feature learning. Initially, patches (P) of size 128 × 128 are extracted from unlabeled fundus images ( X i   u ) , creating a grid of 10 × 10 patches per image for a total of P T patches. These patches, shown in Figure 4, include the optic nerve head region or parts of optic nerve head regions of fundus images with significant gradients. A data augmentation strategy is implemented to generate diverse and augmented versions of these original patches. This augmentation process expands the training dataset significantly without requiring manual annotation by skilled persons or external labels. Step 1.2 in Algorithm 2 explains the augmentation process involving different sets of transformations ( N t ) applied to each patch, resulting in a comprehensive collection of patches. These transformations include random variations such as rotation, shear range, zoom range, horizontal flip, brightness range, width shift, and height shift, enabling the exemplar CNN to learn and capture the diverse patterns and variations present in the fundus images, as illustrated in Figure 4.
For augmenting each patch, a possible parameter (L) of each transformation is chosen from random parameter vectors (aj) such as rotation of 40°, a shear range of 0.2, a zoom range of 0.2, a brightness range of (0.5, 1.5), a width shift of 0.5, and a height shift of 0.5. This transformation is applied to each fundus patch and is augmented accordingly. The augmented fundus patches (SPj) allow the exemplar CNN to generalize the features effectively and learn robust feature representations, which are beneficial for unsupervised learning tasks like glaucoma detection with unlabeled fundus images. Each set of transformations is assigned a synthetic class label by the pretext task, as shown in Figure 4. By leveraging these synthetic classes created through data augmentation, the exemplar CNN can extract crucial features and generalized patterns from a diverse dataset, which improves the performance of unseen variations.
In the exemplar CNN, the pre-trained ImageNet models VGG16, ResNet-50, Inception-v2, Xception-v3, and MobileNet are now trained using these augmented fundus image patches. The objective is to classify image patches into various synthetic classes while minimizing the Cross-Entropy (CE) loss function. The model’s weights are tuned using the Stochastic Gradient Descent (SGD) optimizer. The optimal weights are extracted and stored based on the highest validation accuracy achieved. The utilization of data augmentation and synthetic classes in the exemplar CNN methodology offers a powerful approach to unsupervised feature learning. This technique enables the model to learn from unlabeled data effectively by leveraging the diversity and variability introduced through augmented patches. Ultimately, this learning involves understanding the variations in characteristics and changes in different features found in the augmented patches, which enhances the model’s capability to detect glaucoma accurately. Further, the trained model from the pretext task has been used for the downstream task of glaucoma detection and is described in the downstream task.
Algorithm 2: Pseudo-label generation
Input: Unlabeled fundus images D u 2 , Output: Pseudo-labels  y p s e u d o l a b e l
  •  Step1: Pretext Task—Discriminative feature learning
  •     Input: Unlabeled fundus images X i u X 1 u ,   X 2 u ,   X q u     D u 3 , where q is the total number of images
  •       Step 1.1 Patch Division:
  •            For each image X i u , in D u 3 ,
  •               randomly sample into patches
  •                  m → Number of patches (P) of size 128 × 128 per image
  •                  P P 1 ,   P 2 ,   P m
  •             P T → m × q; Total number of patches
  •       Step 1.2 Transformation ( T r ) :
  •            Transformation and parameters:
  •                { T r  | r N t }, N t is a set of transformations
  •                a j 1 ,   a j 2 ,     a j K —random parameter vectors, where j = 1, 2,   N t ,
  •               {La}, L is a possible parameter vector in each transformation in N t
  •               { N t € rotation → 40°, shear range → 0.2, zoom range → 0.2,
  •                    horizontal flip, brightness range → (0.5, 1.5),
  •                    width shift → 0.5, height shift → 0.5}
  •            For each patch in P T
  •                 P i T r T P i , where T P i is transformed patch
  •            Set of transformed patches S P T P T
  •       Step 1.3 Class assignment
  •           Assign label j to each transformed patch S P j , where j = 1, 2, , N t
  •       Step 1.4 Training of ImageNet pre-trained model
  •           Load the model
  •           Compile the model with optimizer SGD, loss function CE, classification metrics (val_acc,
    loss)
  •           Train the model with S P j , and label j
  •              for I = 1, ……, num_epoch
  •                  update ImageNet model weights θ using SGD
  •                     θ e   =   θ e 1 η θ L θ e 1 ,     where η   i s   l e a r n i n g   r a t e
  •              end for
  •             Extract the optimal weights θ
  •                     θ * =   θ e * ; where e *   = a r g m a x e     a c c u r a c y v a l e
  •  Step 2: Downstream Task (Glaucoma detection)
  •       Input: X 1 l ,   X 2 l ,     X n l   a n d   y 1 l ,   y 2 l ,     y n l ;   X i   l   &   y i   l D L ,
  •                    where X i l ,   i s   l a b e l e d   f u n d u s   i m a g e   &   y i   l i s   t h e   c o r r e s p o n d i n g   l a b e l ,
  •                          n is the total number of images
  •               T r a i n i n g   d a t a   D t r a n d   T e s t i n g   d a t a   D t e ;     D t r   &   D t e D L
  •               X 1 u ,   X 2 u ,     X k u   D u 2 , where X i     u   i s   u n l a b l e d   f u n d u s   i m a g e & k is the total number of
    images
  •       Step 2.1 Fine-tune the ImageNet model for glaucoma detection:
  •              Load-trained model at step 1.4
  •              Compile the model with SGD, loss function CE, classification metrics (val_acc,
    val_loss)
  •              Fine-tune the trained model with the training data   D t r  // for glaucoma detection
  •                 for e = 1, , num_epoch
  •                    update model weights θ using SGD
  •                        θ e   =   θ e 1 η θ L θ e 1 ,     where η   i s   l e a r n i n g   r a t e
  •                end for
  •               Extract the optimal weights θ
  •                      θ * =   θ e * ; where e *   = a r g m a x e     a c c u r a c y v a l e
  •               Test the trained model with the testing data D t e
  •                   y p r e d i c t = f D t e   ; θ *
  •  Step 3: Pseudo-label prediction using the trained ImageNet model at Step 2.1
  •                  Trained ImageNet model ← D u 2
  •                              y p s e u d o l a b e l = f D u 2   ; θ *
Downstream Task—Glaucoma Detection
In the downstream task, the trained exemplar CNN model is linked to fully connected and softmax layers. It is now fine-tuned using the training data from labeled dataset ( D L ) for the downstream task of glaucoma detection, as detailed in Step 2 of Algorithm 2 and illustrated in Figure 4. The fine-tuning process for glaucoma detection involves minimizing the CE loss function and optimizing the weights of the layers using SGD. The optimal weights are then saved, as explained in Step 2.1 of Algorithm 2. The model is tested with the testing data D t e . The training of glaucoma classification in downstream tasks aids in generation of pseudo-labels for unlabeled fundus images D u 2 in the next step and is explained in Step 3 of Algorithm 2.
The approach employs pre-trained ImageNet models as a feature extractor and applies transfer learning to glaucoma detection, initially trained on 80% of the labeled data D L . This is expected to improve the model’s ability to accurately predict glaucoma on unseen data. Pseudo-labels are generated for 40% ( D u 2 ) of the unlabeled fundus images using this self-supervised learning 1 approach. These pseudo-labels are considered “naïve” because they are generated without explicit supervision or true labels for the unlabeled data. The confidence score of 0.80 is set as the threshold for pseudo-labels to prevent incorrect labels from affecting the model.

4.3. Glaucoma Detection with Expanded Fundus Dataset in the Proposed Method

In Figure 5, glaucoma detection by the proposed method is demonstrated using an enhanced fundus dataset E l a b e l e d ( D L ,   D P L ) , which is Stage 4 in the proposed method, as shown in Figure 2, and the process is described in Algorithm 3. Pseudo-labeled fundus images D P L generated from a model trained on self-supervision 1 in Section 4.2 are fused with labeled fundus images D L to create an enhanced dataset E l a b e l e d , described in Step 1 of Algorithm 3 for training. Encoded features H E of expanded fundus dataset E l a b e l e d are extracted using an AE model trained on self-supervision 2, corresponding to Stage 3 in Figure 2.
The encoded features H E are subsequently utilized to train the SVM classifier for glaucoma detection, explained in Step 3 of Algorithm 3. The hyperparameters C, gamma, and the kernel function of the SVM classifier are optimized using the grid search method during training, which results in the complete proposed method for glaucoma detection. The process of self-supervision using a CAE encoder with an SVM classifier for glaucoma detection is explained in self-supervision 2.

4.3.1. Self-Supervision 2

Figure 5 illustrates the semi-supervised framework for glaucoma diagnosis, which also contributes to self-supervision 2 in the proposed method. This semi-supervised process is described in Steps 1 and 2 of Algorithm 3. The CAE model is used in this framework as a feature extractor through the pretext task of unsupervised learning; different models, such as a vanilla Auto-Encoder (vanilla AE) and Denoising Auto-Encoder (DAE), in addition to CAE, are also tested for performance analysis in the proposed work. The encoders trained by a model through the pretext task of reconstructing fundus images are employed for the downstream task of glaucoma detection. The features of labeled fundus images are extracted from the trained encoder and are used to train an SVM classifier for glaucoma detection, as illustrated in Figure 5. The CAE-SVM framework is referred to as a core architecture for glaucoma classification in the proposed work, as depicted in Stage 3 in Figure 2. These pretext and downstream tasks are detailed in this section.
Pretext Task—Reconstruction of Fundus Images
The unsupervised learning depicted as representation learning in the proposed method is described in Step 1 of Algorithm 3. The framework consists of an encoder composed of three sets of convolutional layers and a max-pooling layer, as shown in Figure 5, culminating in a global average pooling layer that produces the latent encoded features of the fundus image, represented by H in Algorithm 3. These encoded features are then fed into the decoder unit, which consists of the same three convolutional layers and three up-sampling layers, concluding with an output layer that reconstructs the fundus image denoted as X r in Algorithm 3. The reconstructed error is defined as the MSE and denoted as E in Algorithm 3, which is the error function to be minimized over repeated training of the CAE. The resized (128 × 128 pixels) fundus images D u 1 are fed as the input to the auto-coder used in this work. During the AE’s training, the Adam optimizer is used, with a learning rate of 0.001, 50 epochs and a batch size of 128 to minimize for improved reconstruction. Further, the supervised classification of the CAE model for the downstream task of glaucoma detection is detailed in the downstream task.
Algorithm 3: Glaucoma detection with enhanced fundus dataset
Input: Enhanced fundus dataset E l a b e l e d , Output: Detection of glaucoma labels
  •  Step 1: Pretext task—Representation learning for reconstruction of fundus images
  •        Input: X 1 u ,   X 2 u ,   X k u   D u 1 , D u 1 is unlabeled fundus images
  •        Step 1.1 Encoding:
  •           Convert input fundus image X u ,   into latent representation H, by
  •                       H = f( X i u ,   ) = α ( X i u   w + b), i = 1, 2, 3, …… k
  •           where α denotes a non-linear activation function, w denotes weights and b represents bias
  •        Step 1.2 Decoding:
  •           Reconstruct the input fundus image based on the latent representation H
                                 by X r = f ~ H = α w ~   ·   H + b ~
  •        Step 1.3 Calculation of reconstruction error:
  •            Compute mean square error norm, E ( X i       u , X r ) = X i     u X r 2 , E is the reconstruction error cost
    function , X r is the reconstructed fundus image
  •            Update the weights and bias based on backpropagation and repeat until E ≈ 0
  •    Step 2: Downstream Task—Supervised learning of CAE for glaucoma detection
  •      Input: X 1 l ,   X 2 l ,   X n l   a n d   y 1 l ,   y 2 l ,     y n l ;   X i l   &   y i l D L ,
  •                      where X i   l   i s   l a b e l e d   f u n d u s   i m a g e   &   y i   l   i s   t h e   c o r r e s p o n d i n g   l a b e l ,
  •                           n is the total number of images
  •               T r a i n i n g   d a t a   D t r   a n d   T e s t i n g   d a t a   D t e ;   D t r   &   D t e D L
  •           Step 2.1: Classification—SVM classifier
  •               D t r e n c o d i n g   H 1 , H 1 —latent representation of labeled fundus image
  •              Normalize H 1
  •            Training phase:
  •               For each hyperparameters C, gamma, kernel function (initialize the hyperparameters)
  •                  Train SVM classifier using current hyperparameters
  •                            H 1 S V M   y p r e d i c t
  •                  Evaluate the classifier on validation and update the best C, gamma
    & kernel function using the grid-search tuning.
  •                  Return the best model with its corresponding hyperparameters
  •              Testing phase:
  •                  Test CAE-SVM (Core architecture) with the Testing set D t e
  •              Evaluate classification metrics of Core Architecture
  •                  Acc, F1, Prec, Recall, Sen, Spec and AUC
  •  Step 3: Glaucoma detection with an expanded set
  •            Input: Expanded fundus images → X i E = X 1 l ,   X 2 l ,   X n l ,   X 1 u ,   X 2 u ,   X k u &
  •                       y i E   =   y 1 l ,   y 2 l ,   y n l ,   y 1 u ,   y 2 u ,   y k u ; where y i u     y p s e u d o l a b e l
  •                                   X i E   &   y i E   E l a b e l e d     ,       w h e r e   i = 1,2 , 3 ,     n + k
  •                       T r a i n i n g   d a t a   E l a b e l e d   a n d   T e s t i n g   d a t a   D t e     ;   D t e     D L
  •            Step 3.1: Classification—SVM classifier
  •                   E l a b e l e d   e n c o d i n g   H E , H E —latent representation of expanded fundus images
                  Normalize H E
  •             Training phase:
  •                 For each hyperparameter C, gamma, kernel function (initialize the hyperparameters)
  •                        Train SVM classifier using current hyper parameters
  •                             H E S V M   y p r e d i c t
  •                        Evaluate the classifier on validation and update the best C, gamma &
    kernel function using the grid-search tuning.
  •                  Return the best model with its corresponding hyper parameters
  •             Testing phase:
  •                  Test CAE-SVM with the Testing set D t e
  •             Evaluate classification metrics of the proposed model
  •                  Acc, F1, Prec, Recall, Sen, Spec and AUC
Downstream Task—Glaucoma Detection
Step 2 of Algorithm 3 explains the supervised learning process of CAE for a downstream task of glaucoma detection, also shown in Figure 5. The weights of the trained encoder from the pretext task are frozen at Step 1 and transferred to extract the features H1, as indicated in Step 2 of Algorithm 3, from the labeled dataset D L for the downstream task of glaucoma detection. The output of the trained encoder is linked to an ML classifier like SVM, which is then trained for glaucoma detection, as depicted in Figure 5 and explained in Step 2.1 of Algorithm 3. Figure 6 shows the full CAE-SVM architecture, detailing layers, nodes, and key parameters. The CAE encoder has three convolutional layers with 128, 64, and 32 filters, each with a 3 × 3 kernel, stride 1, and ReLU activation, followed by 2 × 2 max-pooling. A global average pooling layer reduces spatial dimensions, producing the latent representation, which is used as input for the SVM classifier. The hyperparameters of the SVM classifier, such as C, gamma, and kernel functions, are tuned using the grid search method described in the training phase of Algorithm 3. Table 6 displays the optimized hyperparameters of an SVM classifier for various AEs used in the proposed method when trained on the pseudo-labels generated by an expanded set of small and large fundus datasets.
In the testing phase, the core architecture is tested with the testing dataset D t e , and the classification metrics of the core architecture in the proposed method are evaluated. Further, this framework is also built with other AE models like vanilla AE and DAE, trained, tested, and assessed for glaucoma detection for comparison. Step 3 of Algorithm 3 details Stage 4 of Figure 2 for glaucoma detection. The encoded features H E of the expanded fundus dataset E l a b e l e d   are extracted from the CAE model in self-supervision 2 and used to train the SVM classifier for glaucoma detection. During testing, the D t e fundus image dataset is applied to the CAE-SVM model, which has been trained on an expanded fundus image dataset. The model then predicts labels, and various classification metrics are calculated, as detailed in Step 3 of Algorithm 3. These metrics include accuracy, F1 score, precision, recall, sensitivity, specificity, and AUC. A summary of the training and evaluation setups for all experiments related to the proposed method is provided in the Supplementary Material (Table S6).

5. Experimental Results and Discussion

The proposed work introduces a twin self-supervised learning framework for glaucoma detection. This framework utilizes a small unlabeled dataset ( S 1 ) and a large unlabeled fundus dataset ( L 1 ), from which pseudo-labels are generated using an exemplar CNN method. The system is designed to train an AE model in conjunction with an SVM classifier for effective glaucoma detection. The proposed model has been evaluated using various AEs trained on expanded fundus datasets, created using pseudo-labeled data with varying sizes. The pseudo-labels of these datasets were generated through transfer learning and exemplar CNN self-supervised learning, utilizing different pre-trained models. The experimental results on an internal (same-source) dataset, which combines fundus images from the ORIGA and Drishti-GS datasets, are detailed and analyzed in Section 5.1, Section 5.2, Section 5.3, Section 5.4. A comparison of the core architecture, the proposed method, and state-of-the-art approaches is explored and discussed in Section 5.5. Toward the end of Section 5.5, the results of internal (same-source) dataset runs on ORIGA/Drishti-GS are also discussed. Lastly, evidence of external generalization is presented and analyzed using both the CRFO v4 (external dataset based on leave-one-dataset-out protocol) and Papilla datasets (external fully unseen).

5.1. Core Architecture Results for Glaucoma Detection

The core architecture under consideration for glaucoma detection is a trained AE model, as shown in Figure 2, which results from representation learning through a self-supervised learning 2. Initially, AE was trained in an unsupervised manner using 40% of the available fundus images ( D u 1 ) and then compared with the advancements proposed by this work. The AE is used as a feature extractor for glaucoma detection in conjunction with an SVM classifier, as given in Figure 2. Three types of AE models, a vanilla AE with a feed-forward network, a CAE, and a DAE, are utilized in this manner for glaucoma detection, and their performances have been compared to design a better model. Table 7 presents the performance metrics: accuracy, F1 score, precision, recall, sensitivity, specificity, and AUC of the three AE models. The performance metrics are defined as
A c c u r a c y   ( A c c ) = T p + T N T P + F P + T N + F N
P r e c i s i o n   ( P r e c ) = T p T P + F P
R e c a l l   ( R e c a l l ) = T p T P + F N
F 1   s c o r e   ( F 1 ) = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
S e n s i t i v i t y   ( S e n ) = T p T P + F N
S p e c i f i c i t y   ( S p e c ) = T N T N + F P
where T P ,   F P , T N ,   a n d   F N are the number of true positives, false positives, true negatives, and false negatives, respectively.
Core architecture serves as a benchmark for evaluating and comparing subsequent models or enhancements. This core architecture contains an AE model, which is also one kind of self-supervised learning framework. Among the tested AE models on the internal (same-source) dataset formed by combining fundus images from ORIGA and Drishti-GS datasets, the DAE performs the best, achieving 93% accuracy, 93% recall, 93% precision, and 0.94 AUC. This is because DAE can develop robust, general features by learning to reconstruct clean data from corrupted inputs. Adding noise during training helps the model focus on essential data structure, ignoring irrelevant variations, leading to better generalization and downstream performance tasks. The CAE achieved the second-highest accuracy of 85%, with a recall of 86%, precision of 85%, and an AUC of 0.88, while the vanilla AE produced an accuracy of 80%, a recall of 83%, precision of 80%, and an AUC of 0.84. In Figure 7, the training and validation loss plot of different AEs illustrates their performance in terms of mean square error. The DAE stands out with significantly lower validation error (<0.01) than the other models, indicating its superior ability to capture the features for fundus image reconstruction. When comparing sensitivity among the models, the DAE achieved the highest values for all parameters, but its sensitivity ranked the second lowest at 85%. On the other hand, the vanilla AE had the highest sensitivity at 90%, while the CAE had the lowest sensitivity at 71% among all the models. The DAE with SVM classifier is highly successful in accurately identifying normal classes due to its high specificity or precision. However, it faces challenges in detecting some glaucoma cases, most likely due to the limited data in the training set. The observations from Table 7 show that there is a possibility of generating a good AE model with an SVM classifier, which is capable of generating good performance metrics above 90%. This can be accomplished by expanding training data by generating pseudo-labels for the unlabeled fundus images.
To validate observations, another self-supervised method using exemplar CNN feature learning generates naive pseudo-labels with a pre-trained model. The fundus dataset D u 3 trained a model to classify patches based on geometric transformations, then fine-tuned it for glaucoma diagnosis. Unlabeled images D u 2 are applied to this model to create pseudo-labels for small ( S 1 ) and large datasets ( L 1 ), testing performance across different data volumes for clinical or practical use. These pseudo-labeled images (both small and large) are combined with labeled data to expand datasets ( E l a b e l e d ) for further analysis, with AE-based models fine-tuned for glaucoma diagnosis. Six pre-trained models for pseudo-label generation (VGG16, ResNet-50, MobileNet, Inception-v2, Xception-v3, and DenseNet-101) are compared in terms of performance, demonstrating the effectiveness of twin self-supervised learning, especially with limited labels. Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13 show the contribution of the exemplar CNN approach for generating pseudo-labels in the AE-based glaucoma diagnosis model. Pseudo-labels are generated under two conditions. Case A: Pseudo-labels are created by fine-tuning pre-trained models for glaucoma detection without employing an exemplar CNN approach. Case B: Pseudo-labels are produced by utilizing exemplar CNN (in self-supervised learning 1) before fine-tuning the pre-trained models for glaucoma detection.

5.2. Performance of Vanilla AE with Pseudo-Labeled Fundus Images D P L and Varying Dataset Size

The features of the expanded fundus images (from labeled fundus images D L and a range of both small and large pseudo-labeled fundus images) are extracted using the trained encoder of the vanilla AE model. These extracted features are then utilized to train an SVM classifier for glaucoma detection. The vanilla AE-SVM model evaluation was conducted on internal (same-source) test data derived from the combined fundus images of the ORIGA and Drishti-GS datasets. The vanilla AE model in Table 8 presents the performance of a vanilla AE trained on a small pseudo-labeled (generated in Case A and Case B) expanded dataset using different pre-trained models. In Case A, among these models, VGG16 and Xception-v3 help to achieve similar high metrics on a vanilla AE, with an accuracy of 93%, recall of 94%, precision of 93%, and an AUC of 0.94 in fine-tuning without exemplar CNN. DenseNet-101 follows with the next highest values, achieving an accuracy of 85%, recall of 80%, precision of 90%, and an AUC of 0.80. All the other models have shown very low value across all metrics. Considering the sensitivity of all other models, the sensitivity of VGG16 and Xception-v3 is 96%, while the specificity is 92%. All other models have a specificity of 100% and very low sensitivity values. This indicates that other models, except for VGG16 and Xception-v3, have low sensitivity and are, therefore, unable to detect glaucoma cases effectively. VGG16 and Xception-v3, however, can accurately identify glaucoma cases. In Case B, with the exemplar CNN method, the MobileNet model excels with accuracy, recall, precision, sensitivity, and specificity of 100%, respectively, and an AUC of 1.00, which could be the issue with overfitting. VGG16 and DenseNet-101 achieve an accuracy of 98%, AUC of 0.99, recall of 99%, and sensitivity of 100%. The next highest metrics’ values are achieved by Xception-v3, with an accuracy and recall of 97%, sensitivity of 96%, and AUC of 0.97. Overall, from Table 8, for Case A (without exemplar CNN) and Case B (with the exemplar CNN), the vanilla AE performs well in identifying glaucoma cases and non-glaucoma cases, especially when pseudo-labels generated with the VGG16 pre-trained model using the exemplar CNN method are included, demonstrating superior performance in glaucoma diagnosis.
Table 9 presents a comparison of performance metrics obtained from a vanilla AE trained on a large pseudo-labeled (generated in Case A and Case B) expanded dataset using different pre-trained models. Vanilla AE trained on an expanded dataset generated by the ResNet-50 model excels in both Case A (without exemplar CNN) and Case B (with exemplar CNN) with accuracy, recall, precision, sensitivity, and specificity of 100% and AUC of 1.00. In Case A, VGG16 follows with an accuracy of 95%, a recall of 96%, a precision of 94%, an AUC of 0.96, a sensitivity of 100%, and a specificity of 92%, while the Inception-v2 model achieves an accuracy of 95%, a recall of 93%, precision of 96%, AUC of 0.93 and sensitivity of 87%. Other models, such as MobileNet, Xception-v3, and DenseNet-101, have lower metric values than others. However, using the exemplar CNN method (Case B), the Inception-v2 model achieved the second highest metric values with accuracy and recall of 98%, precision of 99%, AUC of 0.98 each, and sensitivity of 96%. Although models like VGG16 and DenseNet-101 demonstrate 100% sensitivity, they show significantly lower scores in other metrics such as accuracy, precision, recall, specificity, and AUC, which hinders their ability to detect normal cases effectively.
Other models, such as Xception-v3 and MobileNet, do not enhance performance in detecting glaucoma with the addition of the large pseudo-label dataset when using the vanilla AE. Overall, from Table 8, it is evident that VGG16 performs well in identifying glaucoma cases in both Case A (without the exemplar CNN) and Case B (with exemplar CNN). In Case B (with exemplar CNN), the performance of vanilla AE when trained with an expanded dataset generated by VGG16 and DenseNet-101 does well in identifying glaucoma cases and does moderately in identifying non-glaucoma (normal) cases. However, from Table 9, it is observed that vanilla AE with a large, expanded dataset by both models fails to accurately identify non-glaucoma cases in Case B (with exemplar CNN). The Inception-v2 pre-trained model effectively identifies non-glaucoma (normal) cases in both Case A and Case B. However, it demonstrates moderate performance in identifying glaucoma cases in Case B and fails to do so in Case A. ResNet-50 demonstrates 100% in all metrics for Case A and Case B, which may indicate an overfitting issue.
Figure 8 shows the ROC plots of the vanilla AE model when trained on small and large pseudo-labeled expanded datasets based on different pre-trained models Case A (without exemplar CNN) and Case B (with exemplar CNN). The AUC measures the model’s ability to distinguish glaucoma from non-glaucoma cases. Clinically, a high AUC indicates excellent diagnosis, prioritizing true glaucoma cases and reducing false positives, which is vital for early intervention and preventing vision loss. Figure 8a,c show the ROC plots for Case A using small and large expanded datasets. In non-overfitting cases, when comparing these two ROC curves, the highest AUC value of 0.96 was obtained using large pseudo-labeled expanded datasets generated by VGG16. The small pseudo-labeled expanded dataset produced a maximum value of AUC with 0.94 for both VGG16 and Xception-v3. This indicates that the vanilla AE model can discriminate better between glaucoma cases and normal cases using a large pseudo-labeled expanded set than a smaller one. ROC plots for Case B, presented in Figure 8b,d, show that MobileNet and ResNet50 achieve a maximum AUC of 1.00, but comparing other metrics in Table 8 and Table 9, pseudo-labels produced by these models may overfit the vanilla AE model. The second maximum AUC of 0.99 was achieved for Case B with the small pseudo-labeled expanded datasets generated by VGG16 and DenseNet-101 and 0.98 with large pseudo-labeled expanded datasets generated by Inception-v2. Figure 8b shows that the small, expanded dataset provides better AUC values with the most pre-trained models than the large pseudo-labeled expanded dataset in Figure 8d. When comparing the addition of small or large pseudo-labeled data (From Table 8 and Table 9 and Figure 8) with the labeled dataset D L for training the vanilla AE model, pseudo-labeled data generated from VGG16, DenseNet-101, and Inception-v2 pre-trained models were found to improve the performance of glaucoma case detection. Especially, pseudo-labels generated from the exemplar CNN method with these pre-trained models contribute much higher performance to the vanilla AE. This shows improved performance in the detection of glaucoma and the ability to detect both glaucoma and normal cases when exemplar CNN is used. It is observed from Figure 8d that the exemplar CNN (Case B) performance with vanilla AE trained on large pseudo-labeled expanded data is lesser than the small pseudo-labeled expanded fundus data, which could be due to the inadequacy in structural design of the vanilla AE model and leads to incompatible performance.

5.3. Performance of CAE with Pseudo-Labeled Fundus Images D P L and Varying Dataset Size

The features of the expanded fundus images (from labeled fundus images D L and a range of both small and large pseudo-labeled fundus images) are extracted using the trained encoder of the CAE model. These extracted features are then utilized to train a SVM classifier for glaucoma detection. The CAE-SVM model was evaluated using internal test data (same source as training) from the combined fundus images of the ORIGA and Drishti-GS datasets.
Table 10 details the performance of a CAE trained on a small pseudo-labeled (generated in Case A and Case B) expanded dataset using various pre-trained models. Without using the exemplar CNN method (Case A), CAE models with pseudo-labels produced by both the VGG16 and MobileNet models achieve a high accuracy of 93% and precision of 93% but with differences in other metrics, such as recall of 95% and 93% and AUC of 0.95 and 0.93, respectively. The Inception-v2 model achieved the second-highest accuracy of 85%, while the third-highest recall and AUC were obtained from the DenseNet model with values of 84% and 0.84, respectively. All other models perform less than these models. The sensitivity of VGG16 and Densenet-101 yields 100%, while the specificity value becomes lower for these models. CAE models with some other models’ pseudo-labeled data have a specificity of 95%, but their sensitivity becomes lower. It shows that among all the models, pseudo-labeled data by VGG16, MobileNet, and DenseNet-101 models helps in identifying glaucoma cases while other models fail. However, VGG16 fails to detect normal cases as effectively as the DenseNet-101 model, indicated by a low specificity. On the contrary, in Case B, the exemplar CNN method enhances CAE performance for all models except VGG16. Specifically, ResNet-50 and Inception-v2 models deliver the highest and most improved results compared to those without the exemplar CNN method, each with an accuracy of 95%, recall of 96%, precision of 94% and AUC of 0.96. The sensitivity of all these models is 100%, and the specificity is 92% for ResNet-50 and Inception-v2, respectively, while the other models have lower specificity values than these. In summary, comparing Case A (without the exemplar CNN) and Case B (with the exemplar CNN), it is observed that the pre-trained models achieve good classification metrics when the pseudo-labels generated by the exemplar CNN method (Case B) are included in the training. Both ResNet-50 and Inception-v2 exhibit superior performance in glaucoma diagnosis, as well as effectively identifying non-glaucoma (normal) cases in Case B.
Table 11 highlights the performance of the CAE model using a large pseudo-labeled (generated in Case A and Case B) expanded dataset using different pre-trained models. The VGG16 and Inception-v2 models perform best in the CAE model without the exemplar CNN method (Case A), with accuracy of 95%, precision of 94%, and recall of 95% and 96%, sensitivity of 96% and 100%, specificity of 95% and 92%, AUC of 0.95 and AUC of 0.96, respectively, while the Xception-v3 model gives the next highest value with an accuracy of 93%, recall of 95%, precision of 93%, sensitivity of 100%, specificity of 89%, and AUC of 0.95. All other models yield lower values in all metrics compared to these models. However, with the exemplar CNN method (Case B), the CAE model with pseudo-labeled data by Xception-v3 excels in performance with the highest accuracy of 98%, recall of 98%, precision of 99%, sensitivity of 96%, specificity of 100%, and AUC of 0.98. ResNet-50 follows Inception-v2 closely with a performance of 97% accuracy, 97% recall, precision of 96%, sensitivity of 96%, specificity of 100%, and an AUC of 0.97. Once again (from Table 10 and Table 11), all models (ResNet-50, Inception-v2, Xception-v3, MobileNet, DenseNet-101) show improved performance for glaucoma detection by CAE with the large pseudo-labeled dataset, but fail to improve in VGG16. Overall, the large pseudo-labeled enhanced dataset with exemplar CNN (Case B) significantly outperforms without the exemplar CNN approach (Case A).
Figure 9 shows the ROC plots of the CAE model when including both small and large pseudo-labeled expanded datasets based on different pre-trained models Case A (without exemplar CNN) and Case B (with exemplar CNN). Figure 9a,c display the ROC plots of the CAE model trained with both small and large pseudo-labeled expanded datasets in Case A. The CAE model trained on the larger pseudo-labeled expanded dataset, generated using Inception-v2 models, achieved a maximum AUC of 0.96, indicating better overall performance in glaucoma detection. Figure 9b,d illustrate ROC plots of the CAE model of Case B, trained on both small and large pseudo-labeled expanded datasets. The maximum AUC of 0.98 was achieved using the Xception-v3 model with the large pseudo-labeled expanded dataset. This AUC value indicates that the CAE model predicts excellently for both glaucoma and normal cases. Figure 9 shows that the performance of Case B (with exemplar CNN) is enhanced compared to Case A (without exemplar CNN) when using a large pseudo-labeled expanded dataset compared to a small pseudo-labeled expanded dataset. Figure 9c,d indicate that the large pseudo-labeled expanded dataset in Case A and Case B has AUC values exceeding an average of 0.92, ranging from 0.82 to 0.98. On observing Figure 9b,d, the pseudo-labeled expanded dataset from exemplar CNN (Case B) in small and large cases has AUC values exceeding an average of 0.94. The CAE models trained on a larger expanded dataset have an excellent capability to differentiate between glaucoma and normal cases than the smaller one in Case B (with exemplar CNN). When comparing the performance of the CAE model using small and large pseudo-labeled expanded datasets from different pre-trained models (from Table 10 and Table 11 and Figure 9), it is evident that the pseudo-labels produced using the exemplar CNN method contribute to the highest performance in glaucoma detection. Incorporating a substantial number of unlabeled fundus images with pseudo-labels into the CAE training set enhances its performance in glaucoma detection. Specifically, the contribution of pseudo-labels generated from Xception-v3, Inception-v2, and ResNet-50 models (for Case B with a large pseudo-labeled expanded fundus set) to CAE for glaucoma detection is greater in all metrics.

5.4. Performance of DAE with Pseudo-Labeled Fundus Images D P L and Varying Dataset Size

The features of the expanded fundus images (from labeled fundus images D L and a range of both small and large pseudo-labeled fundus images) are extracted using the trained encoder of the DAE model. These extracted features are then utilized to train an SVM classifier for glaucoma detection. The DAE-SVM model was evaluated on internal test data from combined fundus images of ORIGA and Drishti-GS datasets, the same source on which the training carried out.
Table 12 displays the performance of the DAE model using a small pseudo-labeled (generated in Case A and Case B) expanded dataset employing various pre-trained models. Xception-v3 and MobileNet exhibit similar performance, with an accuracy of 95%, precision of 96%, recall of 93%, AUC of 0.93, specificity of 93%, and sensitivity of 87%, with the highest values obtained in Case A (without the exemplar CNN). Other models performed more poorly than these models. The sensitivity of the DAE model compared to vanilla AE and CAE models in Case A is low, indicating that pseudo-labeled data generated from the pre-trained models with DAE are not contributing much to identifying glaucoma cases compared with performances in Section 5.2 and Section 5.3. Under the exemplar CNN method (Case B), the MobileNet and DenseNet101 models yield similar performance values in all metrics—specificity of 100%, precision of 97%, recall of 96%, AUC of 0.96, and sensitivity of 91%—except for accuracy. MobileNet achieves a higher accuracy of 97%, and DenseNet-101 gives an accuracy of 92%. The next higher metrics values were obtained with the contribution of pseudo-labeled data generated from Xception-v3 with accuracy of 95%, recall of 93%, precision of 96%, AUC of 0.93, sensitivity of 87%, and specificity of 100%. The specificity of all these three models shows a good contribution in identifying normal cases rather than glaucoma cases. The VGG16 model reduced performance with exemplar CNN, while other models show improvement in the results for DAE in glaucoma diagnosis. MobileNet models yield a significant improvement in the performance of glaucoma detection, while Xception-v3’s metrics remain almost the same in Case B (with exemplar CNN) compared to Case A (without exemplar CNN). Overall, the DAE model does not produce satisfactory results in identifying glaucoma cases with a small-sized expanded dataset, which could indicate an overfitting issue with the model.
Table 13 highlights the performance of the DAE model using a large pseudo-labeled (generated in Case A and Case B) expanded dataset using different pre-trained models. VGG16 contributes the higher performance without the exemplar CNN (Case A) method, with an accuracy of 93%, recall of 91%, precision of 95%, AUC of 0.91, sensitivity of 83%, and specificity of 100%. Although the specificity of models like Inception-v2 and DensNet101 is quite poor, the sensitivity of DAE for those models contributed by the pseudo-labeled data yields 100%. This result indicates that the pseudo-label produced by these two pre-trained models contributes to the DAE model in identifying glaucoma cases better than normal cases. Overall, from Table 13, it is observed that exemplar CNN (Case B) using the large pseudo-labeled data generated from Xception-v3 and Inception-v2 models significantly improves the performance of the DAE model compared to Case A (without exemplar CNN). Comparing Table 12 and Table 13 results, pre-trained models without the exemplar CNN method (Case A) for a large expanded dataset performs significantly better compared to the DAE model in identifying glaucoma. The DAE model with the exemplar CNN method (Case B) for large pseudo-labeled data produced by Inception-v2 provides the highest performance among other models, with an accuracy of 95%, recall of 94%, and AUC of 0.94. The DAE model with pseudo-labeled data by the Xception-v3 model produces the second highest metric values for the DAE models, with an accuracy of 93% and AUC of 0.91. The models VGG16, DenseNet-101, MobileNet, and ResNet-50 do not exhibit performance improvements in all metrics. Only the sensitivity for VGG16 and DenseNet-101 is improved in Case B, and it is 100%.
Figure 10 shows the ROC plots of the DAE model using (both small and large) pseudo-labeled (generated in Case A and Case B) expanded datasets based on different pre-trained models. Figure 10a,c illustrate the ROC plots for the DAE model trained on both small and large pseudo-labeled expanded datasets in Case A. The higher AUC value of 0.93 was achieved for the ROC curve of the DAE model trained with the smaller pseudo-labeled expanded dataset generated by the MobileNet and Xception-v3 models. Few ROC plots in Figure 10a,c show significantly lower AUC values with small and large pseudo-labeled expanded dataset in Case A of DAE models. Figure 10b,d display the ROC curves of the DAE model in Case B, using both small and large pseudo-labeled expanded datasets. The DAE model with the small pseudo-labeled expanded dataset generated by DenseNet-101 and MobileNet models achieved a maximum AUC of 0.96 with exemplar CNN, which is higher than that in Case A. The average AUC of 0.845 has been achieved for the DAE model for Case B from various pre-trained models, ranging from 0.68 to 0.96. This indicates that the DAE model is less effective at distinguishing between glaucoma and normal cases when compared to the vanilla AE and CAE models, as illustrated in Figure 8 and Figure 9.
When comparing the performance of the DAE model (from Table 12 and Table 13 and Figure 10) for Case A (without exemplar CNN) and Case B (with the exemplar CNN), based on sensitivity and specificity, it is observed that the DAE model’s testing using small or large pseudo-labeled data in the training set leads to two distinct outcomes. In some cases, it achieves 100% accuracy in identifying glaucoma but fails to detect normal cases, while in other cases, it presents the opposite performance. As mentioned in Section 5.1, DAE yields better results than other AE models in all metrics. Still, it does not lead to higher performance when a pseudo-labeled dataset D P L is included with the training set, and it yields an average AUC of 0.85, ranging from 0.68 to 0.96, as shown in Figure 10. Also, in comparison to the results in Section 5.2 and Section 5.3 with this section, it is observed that the DAE model with the exemplar CNN method (Case B), trained with the small pseudo-labeled expanded dataset, yields a higher value of AUC (0.96) than adding a large pseudo-labeled dataset. The addition of large unlabeled data with pseudo-labels in the training set of the DAE does not contribute to superior performance in glaucoma detection. The DAE model does not effectively identify glaucoma cases, possibly due to overfitting or a lack of relevant features required for accurate detection with the extensive pseudo-labeled expanded fundus data.
When comparing the performance metrics from Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13, pseudo-labeled expanded fundus images using exemplar CNN (Case B) of all three auto-encoders show increased performance compared to pseudo-labeled expanded fundus images without using exemplar CNN (Case A) of all three AEs for glaucoma detection. Combining two self-supervised learning methods (one with an AE model for core architecture and another with the exemplar CNN technique for pseudo-label generation) establishes an effective strategy for enhancing diagnosis. This highlights the potential of these approaches in medical image analysis.

5.5. Core Architecture vs. Proposed Method vs. Existing Approaches: Performance Analysis

Comparative results of the core architecture (from Table 7) and the best performances of proposed twin self-supervised models (from Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13 and Figure 8, Figure 9 and Figure 10) are presented in Table 14 and illustrated in Figure 11. When using a vanilla AE model with the proposed method (Table 8 and Table 9 and Figure 8), both a small pseudo-labeled fundus image dataset produced by the VGG16 model (with the labeled fundus image D L ) and a large pseudo-labeled fundus image dataset by the Inception-v2 model (with the labeled fundus image D L ) show similar performance except in terms of sensitivity and specificity metric values in glaucoma detection. Furthermore, both datasets (small and large pseudo-labeled expanded) outperform the core architecture while including the AE models (from Table 7, Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13) with the highest accuracy of 98% and AUC of 0.99. The proposed method recommends the CAE model due to its consistent performance in all cases. In some vanilla AE cases, MobileNet and ResNet-50 show 100% in all metrics. Although 3-fold cross-validation results are consistent and show model stability, the exceptionally high performance metrics suggest the possibility of overfitting. In some cases, MobileNet and ResNet-50 show 100% in all metrics, indicating potential overfitting issues with the models. When utilizing the proposed method with the CAE model (from Section 5.3, Table 10 and Table 11), the incorporation of a large pseudo-labeled fundus image dataset generated by Xception-v3 resulted in the highest accuracy of 98% and an AUC of 0.98. This outperformed the use of a small pseudo-labeled fundus image dataset produced by Xception-v3, which achieved an accuracy of 92% and an AUC of 0.93 in glaucoma detection.
When using the proposed method with the DAE model (from Section 5.3, Table 12 and Table 13), the addition of a small pseudo-labeled fundus image dataset with the labeled fundus image dataset D L generated by the MobileNet model resulted in significantly higher performance. This yielded an accuracy of 97% and an AUC of 0.96, significantly outperforming the large pseudo-labeled dataset generated by the Inception-v2 model, with an accuracy of 95% and an AUC of 0.94. From Table 14, the vanilla AE and the CAE also demonstrate improved performance compared to the core architecture, and both achieve superior performance in glaucoma detection, with metrics of accuracy of 98% and an AUC minimum of 0.97. Among the three AEs, using the proposed method of twin self-supervised learning with CAE consistently resulted in improved performance (Section 5.2 and Section 5.4) across both small and large expanded datasets from all the pre-trained models when applied to the core architecture, except for VGG16 (Table 10 and Table 11 and Figure 11), in glaucoma detection. It can be inferred that the proposed method of using twin self-supervised learning with the CAE model surpasses other AE models, showcasing its robustness and adaptability in glaucoma detection. In comparison with the core architecture, the DAE initially showed the best performance in detecting glaucoma; it no longer had the highest metric values in glaucoma detection when the expanded larger dataset is preferred for better learning.
Based on the results and analysis so far, it is clear that using large pseudo-labels generated with the CAE-SVM combination yields better glaucoma prediction than smaller pseudo-label sets. Therefore, further analysis will focus on large pseudo-labels produced by CAE. Table 15 compares the proposed method with the state of the art that utilizes a transfer learning approach for small-sized labeled fundus images, labeled and unlabeled images in a semi-supervised learning approach, and a pseudo-label generation technique for glaucoma detection. In comparing the results of the proposed work with the other approaches, we found the following:
a. Transfer learning approach: Pre-trained models like [23,25,32] utilize a single database and show promising performance, achieving 89% accuracy in [25] and 100% accuracy in [23]. However, these models encounter challenges related to overfitting, generalizability, robustness, and adaptability. To address these issues, alternative strategies in the transfer learning approach include merging different databases with limited fundus images or increasing the number of images through data augmentation. For instance, [22] achieved the highest AUC of 0.94 after combining various databases, which is lower than the result of the proposed model. This discrepancy may stem from challenges related to compatibility, quality control, complexity, and overfitting risks. Another study has employed separate sets of databases for training and testing [26], which yielded an AUC of 0.90 due to domain shift or mismatched data distributions from the training data to the testing data. Due to data augmentation and the wide availability of high-resolution images in the G1020 dataset, the ResNet-50 model produces an AUC of 0.97 and an accuracy of 98% [27]. However, the model reduced the performance to 92.5% accuracy with the ORIGA dataset due to poor preprocessing results, which are lower than the proposed work. As suggested in the proposed work, overfitting problems caused by limited labeled data in transfer learning can be reduced by integrating pseudo-labeled expanded data into the training process, resulting in improved performance.
b. Semi-supervised learning: An AE model has been used by some researchers to extract features from labeled or unlabeled fundus images in an unsupervised manner, and then these features are applied to the classification model for glaucoma detection. An accuracy of 90.8% is achieved in [28], and the highest AUC of 0.92 is attained in [29], compared with the semi-supervised approaches listed in Table 15. However, other metric values were not reported in these studies. These scores are lower than the results of the proposed method (which uses a two-way feature learning approach). In [30], a semi-supervised framework yields an accuracy of 95% and a precision of 96% using two-layer sparse AEs, demonstrating better performance than the models based on semi-supervised, but not better than the proposed method. The proposed work differs from that of Diaz-Pinto et al. [31] in terms of the model and training of the classification model with the expanded dataset. In [31], an SS-DCGAN model was originally intended for synthesizing images to produce large, unlabeled fundus images, and the later stage of the process is the detection of glaucoma by training the discriminator part of the GAN. The SS-DCGAN model was evaluated for glaucoma detection and achieved an AUC of 0.90, sensitivity of 82.9%, and specificity of 79.8%. The proposed work surpasses the performance of this GAN model.
c. Pseudo-label generation approach: There are two methods for using test data after creating pseudo-labels. One method is to generate pseudo-labels for the entire dataset and then divide them into training and testing sets for further model training [48]. The other method is to split the data into separate sets for training and testing before pseudo-label generation. The proposed work has adopted the second method. The proposed work is directly comparable to this strategy, which generates the pseudo-label and further utilizes this pseudo-labeled data in only training the model for improved performance in glaucoma detection. Only two research works (as shown in Table 15) have been conducted based on the technique of pseudo-label generation for unlabeled fundus images in glaucoma detection, which are then used for training models and evaluating their performance. In [32], the DAE is initially trained with unlabeled fundus images for better feature learning. It employs an encoder with a softmax classifier trained with pseudo-labeled fundus images for glaucoma detection, which produces maximum performance metrics: accuracy of 93.8%, sensitivity of 98.9%, and specificity of 90.5%. However, this approach has relatively lower accuracy than the proposed method, and based on the specificity reported, it fails in classifying normal cases. This might be due to the relatively low number of unlabeled images used for feature learning of normal class, which may not provide enough diverse examples for the model to learn effectively, potentially leading to poor generalization and prediction accuracy. This prompted us to adopt a twin self-supervised strategy in the proposed work and further enhance the model with pseudo-label generation.
Among the three approaches discussed above, an AE model [32] is employed by Alghamdi and Abdel-Mottaleb for glaucoma detection, as indicated in the pseudo-label studies in Table 15, aligning with the proposed methodology. The SSCNN-DAE model achieved 93.8% accuracy, which is lower than the proposed work (overall accuracy of 98%) and even lower than the DAE model utilized in the proposed work, which yields an accuracy of 97%. The proposed work varies from this work in generating pseudo-labels and evaluating the model over varied pseudo-labeled dataset sizes. Another study, which also supports our work, by Fan et al. [33] utilized the self-learning strategy to generate pseudo-labels, combining them with labeled data to fine-tune a multi-task Siamese network for glaucoma detection. The final model achieved an accuracy of 90.2%, an AUC of 0.89, and an F1 score of 38.45%, lower values than the proposed work.
The model proposed by [33] struggled to accurately identify normal cases, likely due to many false positives, false negatives, or both, suggesting that it may misclassify the glaucoma or normal cases. The proposed work utilizes the self-supervised discriminative feature learning exemplar CNN to enhance the prediction of pseudo-labels for unlabeled fundus images, expanding the training set for the semi-supervised AE model. The proposed method achieves an accuracy of 98%, recall of 98%, precision of 99%, sensitivity of 96%, specificity of 100%, and AUC of 0.98. The model showed minimal misclassification, accurately identifying almost all glaucoma cases with few missed. Clinically, this means a low chance of missing true glaucoma cases and perfect specificity by not misclassifying healthy individuals. However, it is important to reiterate that the results presented above are based solely on internal (same-source dataset that was used for training the model) evaluation, primarily involving ORIGA and DRISHTI-GS datasets. During external evaluations, including leave-one-dataset-out or on an entirely unseen dataset, a decline in test performance has been observed, primarily attributable to variations in image characteristics across datasets. Furthermore, the external evaluations are presented with detailed results and discussion to assess the proposed model’s real-world applicability.
Nonetheless, the proposed twin self-supervised framework of combining self-supervised and semi-supervised (also called self-supervised learning) approaches surpasses existing state-of-the-art methods. The proposed framework combines a smaller labeled dataset and a large amount of unlabeled data, which provides considerable clinical benefits that utilize the available data and quickly adapt to clinical data without requiring much retraining with newly labeled data. With specific hospital data, the proposed work has the ability to capture the local patient population’s diverse character. It is also robust and capable of continuous inclusion of more unlabeled data. This approach enhances the accuracy of diagnosis and improves early predictions.
Table 16 examines the four stages of the proposed method, highlighting differences in final model complexity and efficiency. Stage 1 (Exemplar CNN—Xception-v3) is the most resource-intensive, with 189.07 M parameters, over 10,066 million FLOPs, and 1841.46 s of training, showcasing the heaviness of deep CNNs. Stage 2 (Supervised Classification) increases parameters to 298.37 M, with similar FLOPs but significantly shorter training time of 73.55 s, likely due to a more specialized task. Stage 3 (Unsupervised CAE) reduces FLOPs to 983.23 M, with 135.37 M parameters and 492.76 s of training, indicating efficient learning. Stage 4 (CAE-SVM) has the smallest footprint, with just 0.24 M FLOPs, 71.44 s of training, and no trainable parameters. The dramatic reduction in FLOPs over 40,000 times less than Stage 1, illustrates the advantage of combining unsupervised feature extraction with a lightweight SVM. Overall, moving from Stage 1 to 4 shows a decline in computational complexity, with CAE-SVM being the most efficient and scalable, suitable for resource-constrained or real-time applications.
The experiment was performed on the CAE-SVM model (Stage 4 in Figure 2) by excluding either ORIGA or Drishti-GS in the internal (same-source) dataset from the training set for each test in glaucoma detection. The experiment was repeated on the other dataset in the next run. When Drishti-GS was removed from the final training set (which included only ORIGA and D P L ), it was used solely as the test set, resulting in the model achieving 97% accuracy, 97% recall, 97% precision, an AUC of 1.00, 100% sensitivity, and 94% specificity. Similarly, excluding ORIGA from the training set DL (contains Drishti-GS plus D P L ) and using ORIGA as the test set yielded an accuracy of 96%, a recall of 91%, a precision of 97%, an AUC of 0.99, a sensitivity of 100%, and a specificity of 94%. However, it is important to reiterate that these are internal evaluations (based on the same-source dataset), and therefore, the above impressive performance could be influenced by certain features learned from earlier stages (Stage 1 and Stage 3 in Figure 2), since both training and testing sets originate from the same-source dataset. Although these images are excluded from CAE-SVM training, their source similarity may impact results. To validate the model’s high performance and evaluate its robustness and adaptability, the experiment was additionally repeated using a leave-one-dataset-out protocol. In the leave-one-dataset-out protocol, the CRFO-v4 dataset has been entirely removed from the large set L1 of pseudo-label generation datasets. It is no longer used for training in unsupervised learning, glaucoma detection, or pseudo-label generation. Currently, pseudo-labels are generated using CRFO-v4 as a test dataset to evaluate the final model. The model achieved an accuracy of 84%, with a recall of 83%, a precision of 84%, an AUC of 0.91, a sensitivity of 81%, and a specificity of 85%, showing successful learning and strength even without the external dataset CRFO-v4 (leave-one-dataset-out protocol) during training. Additionally, the model was tested on one more independent dataset, known as the Papilla dataset [49], for the cross-dataset validation. On this external fully unseen (Papilla) dataset, the model achieved an accuracy of 85%, a recall of 66%, a precision of 84%, and an AUC of 0.87, with minor adaptations. Despite the model’s high precision and overall accuracy, the recall value suggests that it may miss some true positive cases (resulting in lower recall) in entirely new domains, indicating room for improvement in sensitivity across different datasets. Figure 12 and Figure 13 show the confusion matrix and ROC-AUC curve for CRFO-v4 and Papilla datasets.
The aforementioned experiments across various cases indicate that, although training on large pseudo-labeled dataset was intended to enhance the generalization of the CAE-SVM (Stage 4 in Figure 2) model, it experienced a performance decline of at least 8% across various metrics during cross-dataset evaluation. This decline is attributable to domain shifts, such as differences in imaging protocols and patient demographics, which were not entirely addressed through pseudo-labeling. Similar performance drops in cross-dataset scenarios have also been reported in previous studies [50,51], highlighting the challenge of generalization across heterogeneous datasets. Nevertheless, the model remains competitive, suggesting that larger pseudo-labeling contributes positively to generalization despite domain variability.

6. Conclusions

The proposed work presents a computer-aided diagnosis framework that stands out due to its innovative use of twin self-supervised learning. This approach combines semi-supervised learning using an AE (also called self-supervised learning) and self-supervised learning using ImageNet to tackle challenges associated with limited labeled data and large volumes of unlabeled data in fundus images for glaucoma detection. Proposed self-supervised learning with the exemplar CNN method generates pseudo-labels for the unlabeled data, which are then merged with the limited labeled data to enhance the performance of another self-supervised learning with CAE-SVM combination for glaucoma classification. This twin self-supervised learning framework surpasses the state of the art, achieving all metrics above 95% on the internal test data when training and testing are carried out from the same source, and achieves competitive results even on external, independent (leave-one-out), or fully unseen datasets while testing. Diverse datasets were utilized for the pseudo-label generation technique proposed in the method, and the model was trained using both small and large sets of pseudo-labeled expanded fundus images. Evaluation of the model showed that pseudo-label generation and the integration of largely available unlabeled data significantly enhance model performance. The encouraging results on the training of large dataset L 1 compared to small dataset S 1 highlight the method’s effectiveness, robustness, and adaptability, demonstrating its potential to tackle data scarcity challenges in medical image analysis. A constraint of the proposed work is that the fundus images exhibit class imbalances to mirror the natural clinical occurrence of normal and glaucomatous cases. Future research could investigate strategies to address this class imbalance within semi-supervised learning frameworks, presenting a promising opportunity for further advancements. Furthermore, investigating adversarial learning methods to simultaneously refine the pseudo-label generator and the final classifier could decrease label noise and enhance overall consistency, presenting another promising avenue for future research. The proposed framework enables accurate glaucoma detection with minimal labeled data, thereby reducing the clinician’s burden. Additionally, it can aid in facilitating early diagnosis and large-scale screening in resource-limited settings and in rural areas as well as low-income areas, ultimately helping to prevent vision loss and improve public eye health.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/asi8040111/s1, Table S1: Details of the unsupervised learning dataset (Reconstruction of images); Table S2: Details of the pseudo-labeling dataset (Prediction of pseudo-labels); Table S3: Details of the supervised learning dataset (Glaucoma detection); Table S4: Details of the training set for glaucoma detection via supervised learning; Table S5: Details of the testing set for glaucoma detection; Table S6: Summary of Training and Evaluation Setups for the Proposed Method.

Author Contributions

The authors contributed to the paper as follows. Conceptualization: R.G.J.B. and S.G.; Methodology: R.G.J.B. and S.G.; Formal analysis and investigation: R.G.J.B. and S.G.; Writing—original manuscript preparation: S.G.; Writing—review and editing: R.G.J.B. and S.G.; Supervision: R.G.J.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new datasets were generated during this study. Publicly available datasets were used as part of the analysis. Relevant scripts and code for reproducing the experimental setup can be made available upon reasonable request to interested researchers.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Stein, J.D.; Khawaja, A.P.; Weizer, J.S. Glaucoma in Adults-Screening, Diagnosis, and Management: A Review. JAMA 2021, 325, 164–174. [Google Scholar] [CrossRef]
  2. Hagiwara, Y.; Koh, J.E.W.; Tan, J.H.; Bhandary, S.V.; Laude, A.; Ciaccio, E.J.; Tong, L.; Acharya, U.R. Computer-Aided Diagnosis of Glaucoma Using Fundus Images: A Review. Comput. Methods Programs Biomed. 2018, 165, 1–12. [Google Scholar] [CrossRef] [PubMed]
  3. Sengupta, S.; Singh, A.; Leopold, H.A.; Gulati, T.; Lakshminarayanan, V. Ophthalmic Diagnosis Using Deep Learning with Fundus Images—A Critical Review. Artif. Intell. Med. 2020, 102, 101758. [Google Scholar] [CrossRef] [PubMed]
  4. Bindu, A.; Thangavel, S.K.; Somasundaram, K.; Parthasaradhi, S.; Pulgurthi, R.G.; Dhar, M.Y. Beyond the Black Box: Explainable AI for Glaucoma Detection and Future Improvements. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies, Kamand, India, 24–28 June 2024; pp. 1–9. [Google Scholar] [CrossRef]
  5. Rajanbabu, K.; Veetil, I.K.; Sowmya, V.; Gopalakrishnan, E.A.; Soman, K.P. Ensemble of Deep Transfer Learning Models for Parkinson’s Disease Classification. In Soft Computing and Signal Processing; Springer: Berlin/Heidelberg, Germany, 2022; pp. 135–143. [Google Scholar] [CrossRef]
  6. Thompson, A.C.; Jammal, A.A.; Medeiros, F.A. A Review of Deep Learning for Screening, Diagnosis, and Detection of Glaucoma Progression. Transl. Vis. Sci. Technol. 2020, 9, 42. [Google Scholar] [CrossRef] [PubMed]
  7. Zedan, M.J.M.; Zulkifley, M.A.; Ibrahim, A.A.; Moubark, A.M.; Kamari, N.A.M.; Abdani, S.R. Automated Glaucoma Screening and Diagnosis Based on Retinal Fundus Images Using Deep Learning Approaches: A Comprehensive Review. Diagnostics 2023, 13, 2180. [Google Scholar] [CrossRef]
  8. Zhou, W.; Gao, Y.; Ji, J.; Li, S.; Yi, Y. Unsupervised Anomaly Detection for Glaucoma Diagnosis. Wirel. Commun. Mob. Comput. 2021, 2021, 5978495. [Google Scholar] [CrossRef]
  9. Raza, K.; Singh, N.K. A Tour of Unsupervised Deep Learning for Medical Image Analysis. Curr. Med. Imaging 2021, 17, 1059–1077. [Google Scholar] [CrossRef]
  10. Mo, J.; Gan, Y.; Yuan, H. Weighted Pseudo Labeled Data and Mutual Learning for Semi-Supervised Classification. IEEE Access 2021, 9, 36522–36534. [Google Scholar] [CrossRef]
  11. Seydgar, M.; Rahnamayan, S.; Ghamisi, P.; Bidgoli, A.A. Semisupervised Hyperspectral Image Classification Using a Probabilistic Pseudo-Label Generation Framework. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5535218. [Google Scholar] [CrossRef]
  12. Al Ghamdi, M.; Li, M.; Abdel-Mottaleb, M.; Shousha, M.A. Semi-Supervised Transfer Learning for Convolutional Neural Networks for Glaucoma Detection. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings 2019, Brighton, UK, 12–17 May 2019; pp. 3812–3816. [Google Scholar] [CrossRef]
  13. Cui, Y.; Ji, X.; Xu, K.; Wang, L. A Double-Strategy-Check Active Learning Algorithm for Hyperspectral Image Classification. Photogramm. Eng. Remote Sens. 2019, 85, 841–851. [Google Scholar] [CrossRef]
  14. Sun, B.; Kang, X.; Li, S.; Benediktsson, J.A. Random-Walker-Based Collaborative Learning for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 212–222. [Google Scholar] [CrossRef]
  15. Zhang, Y.; Cao, G.; Li, X.; Wang, B.; Fu, P. Active Semi-Supervised Random Forest for Hyperspectral Image Classification. Remote Sens. 2019, 11, 2974. [Google Scholar] [CrossRef]
  16. Xu, X.; Liao, J.; Cai, L.; Nguyen, M.C.; Lu, K.; Zhang, W.; Yazici, Y.; Foo, C.S. Revisiting Pretraining for Semi-Supervised Learning in the Low-Label Regime. Neurocomputing 2024, 565, 126971. [Google Scholar] [CrossRef]
  17. Krishnan, R.; Rajpurkar, P.; Topol, E.J. Self-Supervised Learning in Medicine and Healthcare. Nat. Biomed. Eng. 2022, 6, 1346–1352. [Google Scholar] [CrossRef]
  18. Yang, X.; Song, Z.; King, I.; Xu, Z. A Survey on Deep Semi-Supervised Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 8934–8954. [Google Scholar] [CrossRef]
  19. Mao, J.; Yin, X.; Zhang, G.; Chen, B.; Chang, Y.; Chen, W.; Yu, J.; Wang, Y. Pseudo-Labeling Generative Adversarial Networks for Medical Image Classification. Comput. Biol. Med. 2022, 147, 105729. [Google Scholar] [CrossRef]
  20. Shurrab, S.; Duwairi, R. Self-Supervised Learning Methods and Applications in Medical Imaging Analysis: A Survey. PeerJ Comput. Sci. 2022, 8, e1045. [Google Scholar] [CrossRef]
  21. Krishnan, S.; Amudha, J.; Tejwani, S. Gaze Fusion-Deep Neural Network Model for Glaucoma Detection. Commun. Comput. Inf. Sci. 2021, 1366, 42–53. [Google Scholar] [CrossRef]
  22. Gómez-Valverde, J.J.; Antón, A.; Fatti, G.; Liefers, B.; Herranz, A.; Santos, A.; Sánchez, C.I.; Ledesma-Carbayo, M.J. Automatic Glaucoma Classification Using Color Fundus Images Based on Convolutional Neural Networks and Transfer Learning. Biomed. Opt. Express 2019, 10, 892. [Google Scholar] [CrossRef]
  23. Sushil, M.; Suguna, G.; Lavanya, R.; Nirmala Devi, M. Performance Comparison of Pre-Trained Deep Neural Networks for Automated Glaucoma Detection. In Lecture Notes in Computational Vision and Biomechanics; Springer: Berlin/Heidelberg, Germany, 2019; Volume 30, pp. 631–637. [Google Scholar] [CrossRef]
  24. Latif, J.; Tu, S.; Xiao, C.; Rehman, S.U.; Sadiq, M.; Farhan, M. Digital Forensics Use Case for Glaucoma Detection Using Transfer Learning Based on Deep Convolutional Neural Networks. Secur. Commun. Netw. 2021, 2021, 4494447. [Google Scholar] [CrossRef]
  25. Suguna, G.; Lavanya, R. Performance Assessment of EyeNet Model in Glaucoma Diagnosis. Pattern Recognit. Image Anal. 2021, 31, 334–344. [Google Scholar] [CrossRef]
  26. Singh, L.K.; Pooja; Garg, H.; Khanna, M. Deep Learning System Applicability for Rapid Glaucoma Prediction from Fundus Images across Various Data Sets. Evol. Syst. 2022, 13, 807–836. [Google Scholar] [CrossRef]
  27. Shoukat, A.; Akbar, S.; Hassan, S.A.; Iqbal, S.; Mehmood, A.; Ilyas, Q.M. Automatic Diagnosis of Glaucoma from Retinal Images Using Deep Learning Approach. Diagnostics 2023, 13, 1738. [Google Scholar] [CrossRef] [PubMed]
  28. Bechar, M.E.A.; Settouti, N.; Barra, V.; Chikh, M.A. Semi-Supervised Superpixel Classification for Medical Images Segmentation. Multidimens. Syst. Signal Process 2018, 29, 979–998. [Google Scholar] [CrossRef]
  29. Pal, A.; Moorthy, M.R.; Shahina, A. G-Eyenet: A Convolutional Autoencoding Classifier Framework for the Detection of Glaucoma from Retinal Fundus Images. In Proceedings of the 2018 25th International Conference on Image Processing, ICIP, Athens, Greece, 7–10 October 2018; pp. 2775–2779. [Google Scholar] [CrossRef]
  30. Raghavendra, U.; Gudigar, A.; Bhandary, S.V.; Rao, T.N.; Ciaccio, E.J.; Acharya, U.R. A Two Layer Sparse Autoencoder for Glaucoma Identification with Fundus Images. J. Med. Syst. 2019, 43, 299. [Google Scholar] [CrossRef]
  31. Diaz-Pinto, A.; Colomer, A.; Naranjo, V.; Morales, S.; Xu, Y.; Frangi, A.F. Retinal Image Synthesis and Semi-Supervised Learning for Glaucoma Assessment. IEEE Trans. Med. Imaging 2019, 38, 2211–2218. [Google Scholar] [CrossRef]
  32. Alghamdi, M.; Abdel-Mottaleb, M. A Comparative Study of Deep Learning Models for Diagnosing Glaucoma from Fundus Images. IEEE Access 2021, 9, 23894–23906. [Google Scholar] [CrossRef]
  33. Fan, R.; Bowd, C.; Brye, N.; Christopher, M.; Weinreb, R.N.; Kriegman, D.J.; Zangwill, L.M. One-Vote Veto: Semi-Supervised Learning for Low-Shot Glaucoma Diagnosis. IEEE Trans. Med. Imaging 2023, 42, 3764–3778. [Google Scholar] [CrossRef]
  34. Camara, J.; Rezende, R.; Pires, I.M.; Cunha, A. Retinal Glaucoma Public Datasets: What Do We Have and What Is Missing? J. Clin. Med. 2022, 11, 3850. [Google Scholar] [CrossRef]
  35. Sivaswamy, J.; Krishnadas, S.R.; Joshi, G.D.; Ujjwal, M.J.; Tabish, S. Drishti-GS: Retinal Image Dataset for Optic Nerve Head(ONH) Segmentation. In Proceedings of the 2014 IEEE 11th International Symposium on Biomedical Imaging, ISBI 2014, Beijing, China, 29 April–2 May 2014; pp. 53–56. [Google Scholar] [CrossRef]
  36. Zhang, Z.; Yin, F.S.; Liu, J.; Wong, W.K.; Tan, N.M.; Lee, B.H.; Cheng, J.; Wong, T.Y. ORIGA-Light: An Online Retinal Fundus Image Database for Glaucoma Analysis and Research. In Proceedings of the 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology 2010, Buenos Aires, Argentina, 31 August–4 September 2010; pp. 3065–3068. [Google Scholar] [CrossRef]
  37. Budai, A.; Bock, R.; Maier, A.; Hornegger, J.; Michelson, G. Robust Vessel Segmentation in Fundus Images. Int. J. Biomed. Imaging 2013, 2013, 154860. [Google Scholar] [CrossRef]
  38. De Vente, C.; Vermeer, K.A.; Jaccard, N.; Wang, H.; Sun, H.; Khader, F.; Truhn, D.; Aimyshev, T.; Zhanibekuly, Y.; Le, T.D.; et al. AIROGS: Artificial Intelligence for RObust Glaucoma Screening Challenge. IEEE Trans. Med. Imaging 2023, 43, 542–557. [Google Scholar] [CrossRef]
  39. Jin, K.; Huang, X.; Zhou, J.; Li, Y.; Yan, Y.; Sun, Y.; Zhang, Q.; Wang, Y.; Ye, J. FIVES: A Fundus Image Dataset for Artificial Intelligence Based Vessel Segmentation. Sci. Data 2022, 9, 475. [Google Scholar] [CrossRef]
  40. Jin, K.; Gao, Z.; Jiang, X.; Wang, Y.; Ma, X.; Li, Y.; Ye, J. MSHF: A Multi-Source Heterogeneous Fundus (MSHF) Dataset for Image Quality Assessment. Sci. Data 2023, 10, 286. [Google Scholar] [CrossRef] [PubMed]
  41. Orlando, J.I.; Barbosa Breda, J.; van Keer, K.; Blaschko, M.B.; Blanco, P.J.; Bulant, C.A. Towards a Glaucoma Risk Index Based on Simulated Hemodynamics from Fundus Images. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2018; Volume 11071, pp. 65–73. [Google Scholar] [CrossRef]
  42. Bajwa, M.N.; Singh, G.A.P.; Neumeier, W.; Malik, M.I.; Dengel, A.; Ahmed, S. G1020: A Benchmark Retinal Fundus Image Dataset for Computer-Aided Glaucoma Detection. In Proceedings of the International Joint Conference on Neural Networks 2020, Glasgow, UK, 19–24 July 2020. [Google Scholar] [CrossRef]
  43. Hassan, T.; Akram, M.U.; Nazir, M.N. A Composite Retinal Fundus and OCT Dataset with Detailed Clinical Markings of Retinal Layers and Retinal Lesions to Grade Macular and Glaucomatous Disorders. In Proceedings of the 2022 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2), Rawalpindi, Pakistan, 24–26 May 2022; Volume 4. [Google Scholar] [CrossRef]
  44. Chen, M.; Shi, X.; Zhang, Y.; Wu, D.; Guizani, M. Deep Feature Learning for Medical Image Analysis with Convolutional Autoencoder Neural Network. IEEE Trans. Big Data 2017, 7, 750–758. [Google Scholar] [CrossRef]
  45. Keerthana, D.; Venugopal, V.; Nath, M.K.; Mishra, M. Hybrid Convolutional Neural Networks with SVM Classifier for Classification of Skin Cancer. Biomed. Eng. Adv. 2023, 5, 100069. [Google Scholar] [CrossRef]
  46. Wu, H.; Huang, Q.; Wang, D.; Gao, L. A CNN-SVM combined model for pattern recognition of knee motion using mechanomyography signals. J. Electromyogr. Kinesiol. 2018, 42, 136–142. [Google Scholar] [CrossRef]
  47. Dosovitskiy, A.; Fischer, P.; Springenberg, J.T.; Riedmiller, M.; Brox, T. Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1734–1747. [Google Scholar] [CrossRef]
  48. Santos, M.S.; Soares, J.P.; Abreu, P.H.; Araujo, H.; Santos, J. Cross-validation for imbalanced datasets: Avoiding overoptimistic and overfitting approaches [Research Frontier]. IEEE Comput. Intell. Mag. 2018, 13, 59–76. [Google Scholar] [CrossRef]
  49. Kovalyk, O.; Morales-Sánchez, J.; Verdú-Monedero, R.; Sellés-Navarro, I.; Palazón-Cabanes, A.; Sancho-Gómez, J.L. PAPILA: Dataset with fundus images and clinical data of both eyes of the same patient for glaucoma assessment. Sci. Data 2022, 9, 291. [Google Scholar] [CrossRef]
  50. Remyes, D.; Nasef, D.; Remyes, S.; Tawfellos, J.; Sher, M.; Nasef, D.; Toma, M. Clinical Applicability and Cross-Dataset Validation of Machine Learning Models for Binary Glaucoma Detection. Information 2025, 16, 432. [Google Scholar] [CrossRef]
  51. Chowdhury, R.P.; Karkera, N.S. Early Glaucoma Detection using Deep Learning with Multiple Datasets of Fundus Images. arXiv 2025, arXiv:2506.21770. [Google Scholar] [CrossRef]
Figure 1. Sample fundus images [35,36]. Drishti-GS dataset: (a) fundus image of a normal case, (b) fundus image of a glaucoma case. ORIGA dataset: (c) fundus image of a normal case, (d) fundus image of a glaucoma case.
Figure 1. Sample fundus images [35,36]. Drishti-GS dataset: (a) fundus image of a normal case, (b) fundus image of a glaucoma case. ORIGA dataset: (c) fundus image of a normal case, (d) fundus image of a glaucoma case.
Asi 08 00111 g001
Figure 2. Proposed twin self-supervised learning method for effectively detecting glaucoma using fundus images via pseudo-label generation.
Figure 2. Proposed twin self-supervised learning method for effectively detecting glaucoma using fundus images via pseudo-label generation.
Asi 08 00111 g002
Figure 3. Self-supervised learning framework for glaucoma detection in the proposed method.
Figure 3. Self-supervised learning framework for glaucoma detection in the proposed method.
Asi 08 00111 g003
Figure 4. Pseudo-label generation framework for unlabeled fundus images in glaucoma detection using ImageNet model trained on self-supervision 1.
Figure 4. Pseudo-label generation framework for unlabeled fundus images in glaucoma detection using ImageNet model trained on self-supervision 1.
Asi 08 00111 g004
Figure 5. Proposed model based on self-supervision 2 with expanded fundus dataset via pseudo-labeled set created from self-supervision 1 for glaucoma detection.
Figure 5. Proposed model based on self-supervision 2 with expanded fundus dataset via pseudo-labeled set created from self-supervision 1 for glaucoma detection.
Asi 08 00111 g005
Figure 6. Core architecture used in the proposed method for glaucoma classification.
Figure 6. Core architecture used in the proposed method for glaucoma classification.
Asi 08 00111 g006
Figure 7. Training and validation loss (mean square error) plot of different auto-encoders.
Figure 7. Training and validation loss (mean square error) plot of different auto-encoders.
Asi 08 00111 g007
Figure 8. ROC plots of vanilla AE across various analysis cases.
Figure 8. ROC plots of vanilla AE across various analysis cases.
Asi 08 00111 g008
Figure 9. ROC plots of CAE across various analysis cases.
Figure 9. ROC plots of CAE across various analysis cases.
Asi 08 00111 g009
Figure 10. ROC plots of DAE across various analysis cases.
Figure 10. ROC plots of DAE across various analysis cases.
Asi 08 00111 g010
Figure 11. Performance comparison of core architecture and proposed method with small and large pseudo-labeled expanded fundus datasets.
Figure 11. Performance comparison of core architecture and proposed method with small and large pseudo-labeled expanded fundus datasets.
Asi 08 00111 g011
Figure 12. Confusion matrix for external (leave-one-dataset-out) CRFO-v4 dataset and external (fully unseen) Papila dataset of the CAE-SVM model in glaucoma detection.
Figure 12. Confusion matrix for external (leave-one-dataset-out) CRFO-v4 dataset and external (fully unseen) Papila dataset of the CAE-SVM model in glaucoma detection.
Asi 08 00111 g012
Figure 13. ROC-AUC plots for the CAE-SVM model in external (leave-one-dataset-out) CRFO-v4 dataset and external (fully unseen) Papila datasets for glaucoma detection.
Figure 13. ROC-AUC plots for the CAE-SVM model in external (leave-one-dataset-out) CRFO-v4 dataset and external (fully unseen) Papila datasets for glaucoma detection.
Asi 08 00111 g013
Table 1. Research based on transfer learning using labeled fundus images.
Table 1. Research based on transfer learning using labeled fundus images.
ModelDatabaseNo. of ImagesMethod
N G 
VGG19 [22]ESPERANZA1333113Transfer learning
RIM-ONE r111840
RIM-ONE r2255200
RIM-ONE r38574
Drishti-GS3170
VGG16 [23]HRF1515Fine-tuning
Inception-v3 [24]Private1184518Transfer learning
EyeNet [25]RIM-ONE r2255200Transfer learning
Inception-ResNet-V2 [26]ORIGA 482168Transfer learning and Fine-tuning
ACRIMA 309396
HRF 1515
Drishti-GS3170
Private 2013
ResNet-50 [27]G1020724296Data augmentation and Transfer learning
Drishti-GS3170
ORIGA482168
RIM-ONE11841
 N indicates normal fundus image, and G indicates fundus image having glaucoma.
Table 2. Research based on semi-supervised learning using unlabeled fundus image datasets.
Table 2. Research based on semi-supervised learning using unlabeled fundus image datasets.
ModelDatabaseNo. of Images (Unlabeled)No. of Images (Labeled)Method
N G
Super-pixel architecture [28]RIM-ONE r3-8574Co-forest algorithm
Multi model Network G-EyeNet [29] HRF-1515Preprocessing and encoder-decoder CNN training
Drishti-GS3170
RIM-ONE v38574
DRIONS-DB5060
Cascaded sparse auto-encoder [30] Private-294418Feature extraction with Machine Learning (ML) classification
GAN [31]ORIGA 482168Image synthesizing and Feature Extractor
Drishti-GS 3170
RIM-ONE 261194
sjchoi86-HRF 300101
HRF 1827
ACRIMA 309396
Collection of 9 databases84,569--
 N indicates normal fundus image, and G indicates fundus image having glaucoma.
Table 3. Research based on pseudo-label generation techniques.
Table 3. Research based on pseudo-label generation techniques.
ModelDatabaseNo. of Images (Unlabeled)No. of Images (Labeled)Method
N G 
SSCNN-DAE [32]RIM-ONE 255200Unsupervised learning and Transfer learning
RIGA 750
Siamese network [33]ACRIMA 309396Multi-Task Siamese Network (MTSN) + One Vote Veto (OVV)
LAG 31431711
OHTS 71,176 3502
DIGS/ADAGES51844289
 N indicates normal fundus image, and G indicates fundus image having glaucoma.
Table 4. Merged fundus image dataset D f u n d u s considered for the proposed work of glaucoma diagnosis.
Table 4. Merged fundus image dataset D f u n d u s considered for the proposed work of glaucoma diagnosis.
RefDatasetN G Total
[35]Drishti-GS3170101
[36]ORIGA482168650
Total513238751
 N indicates normal fundus image, and G indicates fundus image having glaucoma.
Table 5. Dataset L 1 —considered for comprising large fundus dataset for the proposed work.
Table 5. Dataset L 1 —considered for comprising large fundus dataset for the proposed work.
RefDatasetN G Total
[37]HRF151530
[38]EyePACS-AIROGS 327032706540
[39]FIVES200200400
[40]MSHF265278
[41]LES-AV111122
[42]G10207242961020
[43]CRFO-v4314879
Total427738928169
 N indicates normal fundus image and G indicates fundus image having glaucoma.
Table 6. SVM classifier hyperparameters while training with expanded fundus image set E l a b e l e d in the proposed work.
Table 6. SVM classifier hyperparameters while training with expanded fundus image set E l a b e l e d in the proposed work.
ModelPre-Trained ModelSmall Pseudo-Labeled SetLarge Pseudo-Labeled Set
HyperparametersHyperparameters
CDegreeGammaKernelCDegreeGammaKernel
Vanilla AEVGG16120.1rbf0.180.001rbf
ResNet-500.170.010.0170.002linear
MobileNet0.120.1120.1rbf
Inception-v20.180.01180.001
Xception-v30.120.010.120.1
DenseNet101120.001120.1
CAEVGG160.180.001rbf120.1rbf
ResNet-500.170.0020.170.01
MobileNet120.10.120.1
Inception-v20.180.0010.180.01
Xception-v3120.10.120.01
DenseNet101120.1120.001
DAEVGG16120.01rbf120.01rbf
ResNet-50170.01170.01
MobileNet0.120.0010.120.001
Inception-v20.180.0010.180.001
Xception-v30.120.001 0.120.0001
DenseNet1010.170.001 0.170.001
Table 7. Results comparison of the core architecture in the proposed method with different auto-encoders.
Table 7. Results comparison of the core architecture in the proposed method with different auto-encoders.
Auto-Encoders in the Core ArchitecturePerformance Metrics
Acc (%)F1 (%)Prec (%)Recall (%)Sen (%)Spec (%)AUC
Vanilla AE8084808390710.84
CAE85888586711000.88
DAE93959393851000.94
Table 8. Performance metrics comparison for vanilla AE using a small pseudo-labeled (obtained from various pre-trained models) expanded dataset without (Case A) and with exemplar CNN (Case B).
Table 8. Performance metrics comparison for vanilla AE using a small pseudo-labeled (obtained from various pre-trained models) expanded dataset without (Case A) and with exemplar CNN (Case B).
ModelsCase ACase B
Acc (%)F1 (%)Prec (%)Recall (%)Sen (%)Spec (%)AUCAcc (%)F1 (%)Prec (%)Recall (%)Sen (%)Spec (%)AUC
VGG169393939496920.9498989899100970.99
ResNet-5072628563261000.6393939591831000.91
MobileNet 72628563261000.631001001001001001001.00
Inception-v275688667351000.6792919489781000.89
Xception-v3 9393939496920.949797979796970.97
DenseNet-10185839080611000.8098989899100970.99
Table 9. Performance metrics comparison for vanilla AE using a large pseudo-labeled (obtained from various pre-trained models) expanded dataset without (Case A) and with exemplar CNN (Case B).
Table 9. Performance metrics comparison for vanilla AE using a large pseudo-labeled (obtained from various pre-trained models) expanded dataset without (Case A) and with exemplar CNN (Case B).
ModelsCase ACase B
Acc (%)F1 (%)Prec (%)Recall (%)Sen (%)Spec (%)AUCAcc (%)F1 (%)Prec (%)Recall (%)Sen (%)Spec (%)AUC
VGG1695959496100920.9683838585100740.87
ResNet-501001001001001001001.001001001001001001001.00
MobileNet9089908983950.899090909087920.90
Inception-v295959693871000.9398989998961000.98
Xception-v3 8585868880760.8882788976521000.76
DenseNet-10170707876100530.7659597467100340.67
Table 10. Performance metrics comparison of different pre-trained models for CAE using a small pseudo-labeled enhanced dataset without (Case A) and with exemplar CNN (Case B).
Table 10. Performance metrics comparison of different pre-trained models for CAE using a small pseudo-labeled enhanced dataset without (Case A) and with exemplar CNN (Case B).
ModelCase ACase B
Acc
(%)
F1
(%)
Prec
(%)
Recall
(%)
Sen (%)Spec (%)AUCAcc
(%)
F1
(%)
Prec
(%)
Recall
(%)
Sen (%)Spec (%)AUC
VGG1693939395100890.9592929193100870.93
ResNet-508381858065950.8095959496100920.96
MobileNet9393939391950.9393939395100890.95
Inception-v28583868270950.8295959496100920.96
Xception-v37875817352950.7392929193100870.93
DenseNet-10180808384100680.8487878789100790.89
Table 11. Performance metrics comparison of different pre-trained models for CAE using a large pseudo-labeled enhanced dataset without (Case A) and with exemplar CNN (Case B).
Table 11. Performance metrics comparison of different pre-trained models for CAE using a large pseudo-labeled enhanced dataset without (Case A) and with exemplar CNN (Case B).
ModelCase ACase B
Acc
(%)
F1
(%)
Prec
(%)
Recall
(%)
Sen (%)Spec (%)AUCAcc
(%)
F1
(%)
Prec
(%)
Recall
(%)
Sen (%)Spec (%)AUC
VGG169595949596950.959393939496920.94
ResNet-5091929193100870.9397979697100950.97
MobileNet80808384100680.849090899196870.91
Inception-v295959496100920.9695959496100920.96
Xception-v3 93939395100890.9598989998961000.98
DenseNet10177778182100630.8285858688100760.88
Table 12. Performance metrics comparison of different pre-trained models for DAE using a small pseudo-labeled enhanced dataset without (Case A) and with exemplar CNN (Case B).
Table 12. Performance metrics comparison of different pre-trained models for DAE using a small pseudo-labeled enhanced dataset without (Case A) and with exemplar CNN (Case B).
ModelCase ACase B
Acc
(%)
F1
(%)
Prec
(%)
Recall
(%)
Sen (%)Spec (%)AUCAcc
(%)
F1
(%)
Prec
(%)
Recall
(%)
Sen (%)Spec (%)AUC
VGG1677708770391000.7062627570100390.70
ResNet-507568866735670.6785839090611000.80
MobileNet9595969387930.9397969796911000.96
Inception-v2 8785918365830.8389879285701000.85
Xception-v3 9595969387930.9395959693871000.93
DenseNet1019291919291920.9292969796911000.96
Table 13. Performance metrics comparison of different pre-trained models for DAE using a large pseudo-labeled enhanced dataset without (Case A) and with exemplar CNN (Case B).
Table 13. Performance metrics comparison of different pre-trained models for DAE using a large pseudo-labeled enhanced dataset without (Case A) and with exemplar CNN (Case B).
ModelCase ACase B
Acc
(%)
F1
(%)
Prec
(%)
Recall
(%)
Sen (%)Spec (%)AUCAcc
(%)
F1
(%)
Prec
(%)
Recall
(%)
Sen (%)Spec (%)AUC
VGG1693939591831000.9169697775100500.75
ResNet-509191939083970.9090899387741000.87
MobileNet6869757496530.748988889096840.90
Inception-v2 85858688100760.889595959491970.94
Xception-v3 9090899196870.9193939591831000.91
DenseNet10159587467100340.6761607468100370.68
Table 14. Performance comparison of the core architecture with best performances of proposed twin self-supervised models.
Table 14. Performance comparison of the core architecture with best performances of proposed twin self-supervised models.
MethodModelPerformance MetricsPre-Trained Model
Acc (%)F1 (%)Prec (%)Recall (%)Sen (%)Spec (%)AUC
Core architecture Vanilla AE8084808390710.84
CAE85888586711000.88-
DAE93959393851000.94
Proposed Method—Small pseudo-labeled expanded datasetVanilla AE98989899100970.99VGG16 and DenseNet-101
CAE95959496100920.96ResNet-50 and Ineption-v2
DAE97969796911000.96MobileNet
Proposed Method—Large pseudo-labeled expanded datasetVanilla AE98989998961000.98Inception-v2
CAE98989998961000.98Xception-v3
DAE9595959491970.94Inception-v2
Table 15. Comparison of the proposed work with the state-of-the-art performance.
Table 15. Comparison of the proposed work with the state-of-the-art performance.
Method/ReferencesDatabaseNo. of Images (Unlabeled)No. of Images (Labeled)Performance Metrics
NGAccPrecRecallF1SenSpecAUC
(%)(%)(%)(%)(%)(%)
Transfer Learning
Transfer learning (2019) [22]ESPERANZA 133311388---87890.94
RIM-ONE r1 11840
RIM-ONE r2 255200
RIM-ONE r3 8574
Drishti -GS 3170
Fine-tuning (2019) [23]HRF 1515100100100----
Transfer learning (2021) [25]RIM-ONE r2 25520089878787--0.88
Transfer learning (2022) [26]ORIGA 48216891.698.6-8981.4990.90
ACRIMA 309396
HRF 1515
Drishti- GS 7031
Data augmentation and Transfer learning (2023) [27]G1020 72429698--9899960.97
Drishti-GS 7031
ORIGA 482168
RIM-ONE r1 11841
Semi-supervised Learning
Co-forest algorithm (2018) [28]RIM-ONE r3 857490.8------
Encoder-decoder CNN training (2018) [29]HRF 1515------0.92
Drishti-GS 7031
RIM-ONE r3 8574
DRIONS-DB
Private
50
20
60
13
Feature extraction with ML classification (2019) [30]Private 2944189596-95---
Image synthesizing and feature extractor (2019) [31]Drishti-GS
ORIGA
31
482
70
168
---84.282.979.80.90
RIM-ONE 261194
sjchoi86-HRF 300101
HRF 1827
ACRIMA 309396
Collection of nine databases84,569--
Pseudo-Label Generation
Unsupervised learning and Transfer learning (2021) [32]RIM-ONE r2
RIGA
255
-
200
-
93.8---98.990.5-
750
Deep metric learning (2023) [33]ACRIMA 30939690.2--38.4--0.89
LAG 31431711
OHTS 71,1763502
DIGS/ADAGES 51844289
Core Architecture using DAE (in the proposed work)ORIGA
Drishti-GS
482
70
168
31
93959393851000.94
Proposed workORIGA 48216898999898961000.98
Drishti-GS 7031
HRF 1515
FIVES 200200
MSHF 2652
LES AV 1111
G1020 724296
CRFO-v4 3148
EyePACS—
AIROGS
32703270
Table 16. Computational complexity and Training time of the proposed method—CAE Model.
Table 16. Computational complexity and Training time of the proposed method—CAE Model.
Method/ModelParameters (Millions)FLOPs (Millions per Sample)Training Time (Seconds)
Exemplar CNN—Xception-v3 (Stage 1 in Figure 2)189.0710,066.571841.46
Supervised Classification for pseudo-labeling (Stage 2 in Figure 2)298.3710,067.1673.55
Unsupervised learning—CAE (Stage 3 in Figure 2)135.37983.23492.76
Proposed method—CAE-SVM (Stage 4 in Figure 2)-0.2471.44
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gnanaprakasam, S.; John Barnabas, R.G. Twin Self-Supervised Learning Framework for Glaucoma Diagnosis Using Fundus Images. Appl. Syst. Innov. 2025, 8, 111. https://doi.org/10.3390/asi8040111

AMA Style

Gnanaprakasam S, John Barnabas RG. Twin Self-Supervised Learning Framework for Glaucoma Diagnosis Using Fundus Images. Applied System Innovation. 2025; 8(4):111. https://doi.org/10.3390/asi8040111

Chicago/Turabian Style

Gnanaprakasam, Suguna, and Rolant Gini John Barnabas. 2025. "Twin Self-Supervised Learning Framework for Glaucoma Diagnosis Using Fundus Images" Applied System Innovation 8, no. 4: 111. https://doi.org/10.3390/asi8040111

APA Style

Gnanaprakasam, S., & John Barnabas, R. G. (2025). Twin Self-Supervised Learning Framework for Glaucoma Diagnosis Using Fundus Images. Applied System Innovation, 8(4), 111. https://doi.org/10.3390/asi8040111

Article Metrics

Back to TopTop