Next Article in Journal
Correction: Lei et al. The Diagnostic Accuracy of Colon Capsule Endoscopy in Inflammatory Bowel Disease—A Systematic Review and Meta-Analysis. Diagnostics 2024, 14, 2056
Next Article in Special Issue
Artificial Intelligence–Based Prediction of Subjective Refraction and Clinical Determinants of Prediction Error
Previous Article in Journal
Residual Safety Margin-Based Risk Stratification for Hospital-Wide POCT Glucose Meters Anchored to ISO 15197: Moving Beyond Pass-Fail
Previous Article in Special Issue
Automated Multi-Class Classification of Retinal Pathologies: A Deep Learning Approach to Unified Ophthalmic Screening
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Self-Supervised Representation Learning for Data-Efficient DRIL Classification in OCT Images

by
Pavithra Kodiyalbail Chakrapani
1,
Akshat Tulsani
2,
Preetham Kumar
1,*,
Geetha Maiya
1,
Sulatha Venkataraya Bhandary
3 and
Steven Fernandes
4,*
1
Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal 576104, India
2
Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA
3
Department of Ophthalmology, Kasturba Medical College Manipal, Manipal Academy of Higher Education, Manipal 576104, India
4
Department of Computer Science, Design and Journalism, Creighton University, Omaha, NE 68187, USA
*
Authors to whom correspondence should be addressed.
Diagnostics 2025, 15(24), 3221; https://doi.org/10.3390/diagnostics15243221
Submission received: 3 November 2025 / Revised: 8 December 2025 / Accepted: 15 December 2025 / Published: 16 December 2025
(This article belongs to the Special Issue Artificial Intelligence in Eye Disease, 4th Edition)

Abstract

Background/Objectives: Disorganization of the retinal inner layers (DRIL) is an important biomarker of diabetic macular edema (DME) that has a very strong association with visual acuity (VA) in patients. But the unavailability of annotated training data from experts severely limits the adaptability of models pretrained on real-world images owing to significant variations in the domain, posing two primary challenges for the design of efficient computerized DRIL detection methods. Methods: In an attempt to address these challenges, we propose a novel, self-supervision-based learning framework that employs a huge unlabeled optical coherence tomography (OCT) dataset to learn and detect clinically applicable interpretations before fine-tuning with a small proprietary dataset of annotated OCT images. In this research, we introduce a spatial Bootstrap Your Own Latent (BYOL) with a hybrid spatial aware loss function aimed to capture anatomical representations from unlabeled OCT dataset of 108,309 images that cover various retinal abnormalities, and then adapt the learned interpretations for DRIL classification employing 823 annotated OCT images. Results: With an accuracy of 99.39%, the proposed two-stage approach substantially exceeds the direct transfer learning models pretrained on ImageNet. Conclusions: The findings demonstrate the efficacy of domain-specific self-supervised learning for rare retinal pathological detection tasks with limited annotated data.

Graphical Abstract

1. Introduction

Diabetic retinopathy is a major complication of diabetes mellitus, which is a chronic condition that affects a large population because of its widespread incidence. Diabetic macular edema (DME) remains the major cause of vision loss in diabetic individuals. According to the statistics by the International Diabetes Federation (IDF) Diabetes Atlas (11th edition), 1 among 9 individuals within an age group of 20–79 possesses diabetes, and 40% of the population are not aware of the fact that they have acquired the illness. IDF forecasts that 853 million people, or one of eight aged individuals, are expected to acquire the disease, indicating a rise of 46% in incidence by 2050 [1]. The diagnosing, tracking, and treatment of DR and DME have become vital in recent years. Imaging modalities like OCT provide high-resolution imaging of the retinal tissue, which allows for clear visualization of changes in the morphology of the retina in a non-invasive manner [2]. These imaging methods have become vital for guiding therapy and predicting patient outcomes, along with improved efficiency in diagnosis.
The retinal health and the structural abnormalities can be assessed with the help of noninvasive OCT imaging biomarkers. Subretinal fluid (SRF), cystic spaces within the intraretinal region, and DRIL, along with the thickness of the retina, are all crucial biomarkers on OCT that are designated as important identifiers of the disease severity, response to the treatment, and development. A vital imaging biomarker for DR, primarily DME, is DRIL apparent in OCT images. According to Sun et al. DRIL is defined as ”Disorganization of the retinal inner layers was defined as the horizontal extent in microns for which any boundaries between the ganglion cell–inner plexiform layer complex, inner nuclear layer, and the outer plexiform layer could not be identified” [3]. Figure 1 illustrates the absence and presence of DRIL in the OCT image.
The importance of DRIL as a reliable biomarker across multiple dimensions has been demonstrated by extensive clinical research. Novel possibilities have been offered by Deep Learning (DL) in the domain of healthcare imaging for identification of biomarkers and computer-based disease categorization. DL architectures have provided exceptional outcomes in the case of retinal image classification tasks, typically comparable to or even surpassing the results provided by doctors in the identification of a variety of OCT characteristics, including hyperreflective retinal foci (HRF), SRF, and structural differences in the ellipsoid zone (EZ).
Even though the current treatment methods for DME and DR are more successful, early detection of disease and treatment care are always beneficial. Use of automated disease identification tools that can categorize the patients’ groups to refer them to a suitable ophthalmologist is very important. Individuals with DME are often tested and scanned with the tool OCT in routine clinical practice. The OCT biomarker DRIL is proved to possess an association with RNFL layer thickening, decreased VA, EZ disruption, and retinal function impairment. It is very difficult to detect DRIL, as OCT variations are subtle, especially during the earlier manifestations of DME and DR [5]. Latest developments in the field of deep learning and artificial intelligence have shown enhanced potential for faster, computerized, and accurate health evaluation in the ophthalmic domain.
Unique and potential OCT biomarkers like DRIL are crucial, as there are no alternative methods in the literature to understand when people with DME may lose or gain their eyesight over time. Potential biomarkers like DRIL, the site of presence of fluid accumulation, and the variations in the morphology of the retina can all be identified with the help of OCT. To categorize people based on disease prognosis in research related to retinal health, identifying DRIL at baseline becomes important. Grading of DRIL by human experts is tedious, whereas automated methods can speed up the work and will allow for more extensive research to identify the connection within the DRIL development and response to treatments. A few researchers have attempted to use OCT with DL to diagnose ME associated with other disorders by leveraging the power of AI models. The method used in the implementation, the uniqueness of DRIL, and the idea of integrating DRIL detection with computer-based diagnostic methods are detailed in this research.

2. Related Work

This section details the state-of-the-art methods on DRIL detection and also the studies that define the association of DRIL with other OCT biomarkers. Sun et al. [3] defined DRIL as a novel biomarker where the study proved that extended severity of DRIL at baseline is correlated with decreased VA. Importantly, changes in DRIL over a time of initial four months were a crucial predictor of VA results over a period of eight months compared to central subfield thickness (CST). This study gave useful insights into the characteristics of DRIL and its associations, proving that DRIL is a noninvasive, reliable predictor of vision. Radwan et al. [6] revealed that in the presence of DME, the alterations in the resolution of DRIL are strongly correlated with future VA. In the case of severe DRIL, VA outcomes were very bad, but as DRIL cleared, there was improvement in the VA. This research proved that in the context of diabetic abnormalities, DRIL becomes a unique predictive biomarker. An extended follow-up study by Sun et al. [7] was focused on assessing 80 eyes with resolved and ongoing DME. They showed that, in both cases, DRIL spanning more than 50% of the foveal 1 mm region was always linked to lower VA. Grewal and Jaffe [4] defined DRIL as the “inability to distinguish between the outer plexiform layer, inner nuclear layer, and ganglion cell layer-inner plexiform layer complex.” They proved DRIL to be a reliable predictive biomarker for evaluating VA in cases of both uveitic and diabetic ME. They stated DRIL to be a crucial tool when it comes to counselling of patients and therapy. Various studies have documented substantially regarding the use of DRIL as a DME biomarker. Acon and Wu [8] report that there is still a need for discovering more about reliable biomarkers like DRIL, and people can use machine learning to plan treatment options and diagnoses that are built around various imaging modalities, even though OCT offers the best ways to manage DME.
Das et al. [9] analyzed the association of DRIL with the integrity of the outer retinal region. Using SD-OCT, they proved that a larger horizontal extent of DRIL is significantly associated with misalignment of EZ and the external limiting membrane (ELM), and both these factors contributed to lower best corrected VA (BCVA). They proved that DRIL being a surrogate marker of VA is also a predictor for increased morphological degeneration in the outer retina. These findings support the fact that DRIL acts as a pathophysiological link between inner and outer retinal disruption. Joltikov et al. [10] demonstrated a correlation of DRIL with decreased VA in individuals with initial stages of diabetic retinopathy (DR) but do not have ME. They also stated the therapeutic significance of DRIL and reported the possibility that DRIL seems to be an initial cellular consequence of diabetes. Nakano et al. [11] highlighted the role and impact of DRIL as an important tool for detailed visual functioning. They also showed that irrespective of the status of ME and VA, DRIL strongly correlates with the level of metamorphopsia in individuals with DME. These findings were supported by Nadri et al. [12], where they proved the association of DRIL with thinning of the retinal nerve fiber layer and EZ disruption. Di-Luciano et al. [13] performed a semantic assessment of seven research publications evaluating DRIL as an important biomarker of DME.
Most of these articles in the review employed SD-OCT scans with retrospective data-based cohort and cross-sectional designs. Throughout the findings provided by these reports, DRIL resolution was linked to improvement of vision, and the presence and increase in the extent of DRIL were constantly linked to decreased VA. Recent technical improvements in the field of DL have opened up new possibilities for automated DRIL detection. Singh et al. [5] used OCT images to categorize DRIL. This research also outlined the advantage of using DL in therapeutic healthcare decisions by designing the initial DL model for OCT biomarker categorization. Singh et al. [14] developed a DL-based convolutional neural network (CNN) achieving an accuracy of 88.3% in DRIL classification. However, this research lacked explainability even if it was one of the initial attempts to classify DRIL. A system based on fuzzy-logic design is employed by Tripathi et al. [15], leveraging DRIL, HRF, and cystoid spaces for determining DME severity from OCT scans. The method aimed at acquiring and reporting quantitative information to define classes of DME severity. Table 1 provides the details of various datasets used and the results obtained for the DRIL classification by the previous studies.
To bridge the gap between computer-based decision support systems and the clinical decision-making strategies, this work showed that rule-based methods that are easily interpretable, including different biomarkers such as DRIL for computerized assessment of DME severity, are achievable. In patients with DME, to identify hard exudates (HE) and to classify DRIL, Toto et al. [17] offered a DL-based system. They report that DL-supported DRIL detection is viable and provided one of the early extensive AI-based methods to combine classification pipelines with object detection-based strategies for DME biomarker assessment. Irrespective of the treatments given to patients, multiple studies have related the extent of DRIL to both VA at baseline and long-term results, as found by the extensive literature search that assesses DRIL-related articles by Tripathi et al. [19]. They reported the need for additional research to improve DRIL identification and use the results for providing individual patient care. Singuri et al. [16] demonstrate the fact that regardless of the subjective nature of DRIL, it has substantial association with VA status and DR. Ruiz-Medrano et al. [18] identified DRIL as a vital biomarker of DME on OCT by performing a study to predict treatment options. It is proved that DRIL presence, along with other biomarkers, leads to the failure of anti-VEGF treatments, necessitating the employment of additional treatment approaches.

3. Materials and Methods

This section provides details of the method used, the datasets, the training pipeline, and Figure 2, utilized for classification of DRIL with limited annotated data conditions. We employ a Bootstrap Your Own Latent (BYOL) [20] learning framework based on self-supervision where a ResNet-50 backbone pretrained on ImageNet functions as an encoder. 108,309 unlabeled OCT images are used to refine this encoder to learn domain-related features of the retina in a class-agnostic approach. DRIL classification is thus achieved through the adaptation of this pretrained ResNet-50 encoder utilizing only 823 labeled images through transfer learning.

3.1. Mendeley Dataset

A large labelled OCT and chest X-ray collection dataset made available by Kermany et al. [21] is utilized for the research. 109,309 OCT images belonging to four categories—DME, Drusen, Normal, and Choroidal Neovascularization (CNV)—are contained in the dataset. All these retinal OCT images were obtained from the dataset of retinal scans performed at UC San Diego Health and the Shiley Eye Institute. The dataset is subdivided into training, validation, and testing groups to facilitate learning with self-supervision. Since this dataset is publicly available, diverse, and comprehensive for evaluating DL models in retinal image categorization, this dataset has become a common benchmark.

3.2. KMC Dataset

The private, labeled dataset was obtained from Kasturba Medical College (KMC), Manipal, MAHE, Manipal, from the Department of Ophthalmology. This dataset contains fovea-centered, anonymized, original OCT B scans that were used for evaluating the DRIL classification model. We have obtained ethical clearance for the dataset that was used from the Kasturba Medical College and Kasturba Hospital Institutional Ethics Committee, having assigned it an approval code IEC1-287/2022. The retrospective dataset contained horizontal, high-quality, fovea-centered B-scans having a signal strength greater than 7. The dataset was collected over a time period from January 2019 to August 2022. All these OCT images were captured using a certified Zeiss Cirrus HD OCT 5000 imaging device. The dataset contains 429 OCT images with the presence of DRIL, and 394 images do not have DRIL. The method used in the research uses expert-validated consensus annotations created by two experienced ophthalmologists from KMC Manipal. These ophthalmologists have immense professional experience treating patients with a multitude of eye disorders, including DR and DME, over a period of more than twenty-three years. The labelling of the images was done, and final reconciliations were performed to resolve discrepancies using a consensus protocol by these doctors. Thus, the dataset involved is highly reliable with clinical accuracy and relevance, and this provides a strong basis for the training and validation of DL models, preventing bias and promoting model generalization.

Inter-Observer Agreement Statistics

Inter-observer variability metrics help us understand the consistency and differences in the annotations/labeling among the observers/doctors. Two experienced doctors, both having more than 23 years of clinical practice experience, have labelled the 823 OCT images for the presence and absence of DRIL. The metric known as Cohen’s kappa coefficient ( κ ), which considers the agreement between two sets of annotations provided by the doctors, is used to measure inter-observer variability. Among the 823 OCT scans, 796 (96.7%) scans were annotated with 100% agreement between both the doctors. Out of these 796 scans, both doctors agreed upon the presence of DRIL in 409 (49.7%) scans, and 387 (47.0%) were marked as DRIL-absent scans. But for the remaining 27 scans, discordant classifications were provided where 16 images (1.9%) were labelled as DRIL-present by Observer 1 and DRIL-absent by Observer 2. Also 11 images (1.3%) were annotated as DRIL-absent by Observer 1 and DRIL-present by Observer 2. Statistical assessment revealed satisfactory inter-observer agreement with Cohen’s κ = 0.933 (95% CI: 0.906–0.960, p < 0.001), which is really much more than the good reliability threshold ( κ > 0.81). All 27 discordant cases were discussed together by both doctors in order to resolve the discrepancies and provide the final annotations through consensus labelling. The detailed inter-observer agreement analysis is shown in Figure 3. With Cohen’s κ = 0.933 (95% CI: 0.906–0.960, p < 0.001), the confusion matrix (Panel A) shows strong concordance, indicating excellent agreement beyond chance. There was 96.7% overall agreement (796/823 pictures). Observer 1 was classified as DRIL-present in 16 cases (59.3% of conflicts), Observer 2 was classified as DRIL-absent in 11 cases (40.7% of disagreements), and there were very few discordant cases (27 cases, 3.3%). As can be seen from Panel C’s kappa scale, our result is firmly in the “Excellent” area.

3.3. Reprsentation Learning in Anamtomical Context

Standard BYOL applies global average pooling to the encoder’s final feature map to obtain a single global representation of the image. This pooled vector is then passed through a projection MLP and a prediction MLP. While this approach works well for natural image representation learning, it discards all spatial structure. In medical imaging, specifically for OCT, spatial information is essential, as the pathologies are localized and retinal layers follow a fixed geometric structure. To address this, we extend BYOL with spatially aware feature learning and introduce a spatial self-supervision loss aimed at preserving structural information represented with Equation (1). Instead of relying solely on the deepest feature map, we extract latent representations at multiple spatial resolutions from the ResNet-50 backbone.
{ s 1 , s 2 } = f θ ( x ) , s i R B × C i × H i × W i
s 1 R B × 1024 × 14 × 14
s 2 R B × 2048 × 7 × 7
Here, s 1 , s 2 correspond to the feature maps at increasing network depth and decreasing spatial resolution. The final latent representation at s 2 is used for the adaptive average pooling as the standard BYOL implementation. This branch keeps the standard BYOL objective and maintains compatibility with the global downstream task. For spatial learning, we select s 1 as the input to our Spatial Branch. This layer provides an ideal balance between spatial granularity with 196 spatial locations and rich semantic content with 1024 channels. We process s 1 through our custom Spatial Projection Head and Spatial Prediction Head, both of which preserve the spatial grid. The Spatial Projection Head applies a 3 × 3 convolution followed by batch normalization and non-linearity, followed by a 1 × 1 convolution with batch normalization, producing a lower-dimensional spatial representation without destroying location information. The Spatial Prediction Head takes this spatial dimensional representation as input and applies two CNN blocks, yielding a representation, which plays the role of BYOL predictor for the spatial branch. This setup avoids collapsing the feature map into a single vector and instead learns a dense field of spatial predictions, allowing the model to retain anatomical structure.
To capture global semantics and anatomical structure, we introduce a hybrid loss that integrates the standard BYOL global objective with our spatial self-supervision objective. The global branch operates on the deepest feature map s 2 . Considering p 1 g , p 2 g denote the global predictions for two augmented views, and z 1 g , z 2 g denote the corresponding target projections obtained from s 2 , the global BYOL loss is given by the Equation (2).
L global = 1 2 cos ( p 1 g , z 2 g ) + cos ( p 2 g , z 1 g ) , cos ( a , b ) = 2 2 a · b a b
In parallel, the spatial branch operates on s 1 , preserving its H × W resolution. After passing s 1 through the Spatial Projection and Prediction Heads, we obtain spatial prediction maps P 1 , P 2 R B × d s × H × W for the two augmented views. The target network produces corresponding spatial projections Z 1 , Z 2 with the same dimensions. We compute the cosine similarity at each location of the 14 × 14 location for the 128-dimensional vector and take the mean of the cosine similarity across all the locations, formally given by (3).
L spatial = 2 2 · E b , h , w P ^ b , : , h , w , Z ^ b , : , h , w ,
and apply the same symmetric formulation:
L spatial = 1 2 sp ( P 1 , Z 2 ) + sp ( P 2 , Z 1 )
Finally, we combine the global and spatial objectives into the hybrid loss:
L hybrid = ( 0.5 ) L global + ( 0.5 ) L spatial
This hybrid loss enables the model to learn global semantics through s 2 and local anatomical representations through the spatial supervision applied to s 1 . The resulting encoder captures both high-level disease context and fine-grained retinal morphology properties that are critical for OCT lesion localization and segmentation downstream.
We trained the spatial BYOL encoder for 125 epochs with early stopping conditions, with a patience of 25 epochs and a batch size of 64 and gradient accumulation over 4 consecutive mini-batches, yielding an effective batch size of 256 for each optimizer update. This configuration allowed us to maintain a large effective batch size. We used the AdamW optimizer with initial learning rate 3 × 10 4 and weight decay 1 × 10 4 , together with a cosine annealing learning-rate schedule over the 100 epochs. We employed mixed-precision training and gradient clipping with max norm = 1.0 ; all executions were carried out on NVIDIA A100 GPUs. The online and target encoders were coupled via an exponential moving average update with decay τ = 0.996 . The total training loss combined the global BYOL loss and the spatial BYOL loss with equal weighting ( λ spatial = 0.5 ), and the best checkpoint was observed at the 100th epoch and was selected as the backbone based on the minimum total BYOL loss on the training set, all of which are depicted in Figure 4, Figure 5 and Figure 6. Since the framework is fully self-supervised and does not use class labels, all four disease categories—CNV, DME, DRUSEN, and NORMAL—contribute equally to representation learning. This class-independent pretraining encourages the learned features to capture a broad range of OCT characteristics across disease types, improving the generalizability of the downstream models.

3.4. Finetuning for DRIL Identification

We employed the trained encoder for binary classification of DRIL after the self-supervised pretraining. The pretrained BYOL, along with a lightweight classification head, together made the classification model. Two fully connected layers with dropout regularization and batch normalization were included as a classification head. The classification head is responsible for application-specific prediction boundaries, and the high-level features are extracted by the encoder for detection of DRIL. Kaiming initialization was used for the linear layers of the classification head for random initialization.
In order to avoid catastrophic forgetting to enable application-specific adaptation, we employed a two-phase approach of fine-tuning that’s similar to pretraining. The classification head was made trainable, and the pretrained encoder was frozen. This enables the classifier that’s randomly initialized to adapt and learn patterns and features utilizing the pretrained characteristics without disturbing the learned representations. During this phase a very small quantity (2.6 M) of parameters were trainable while taking into consideration 73 M parameters. End-to-end fine-tuning is performed by unfreezing the encoder. The learning rate was lowered by 90% to allow for adaptation of the pretrained characteristics to characteristics that are specific to DRIL detection while promoting stability in learning. Fine-grained adaptation of both the high and low-level characteristics is allowed with this phase for optimized DRIL detection outcomes.
We utilized inverse frequency class weights with a loss function of weighted cross-entropy and the AdamW optimizer with a batch size of 32. When validation accuracy plateaued, dynamic reduction of learning rate was allowed with the ReduceLROnPlateau scheduler, thus enabling lighter optimization in the later training stages. To provide stability during training, gradient clipping was enabled with a norm of 1.0. The efficiency of the self-supervised method was demonstrated with the training converging within an average of 20–25 epochs observed in Figure 7 and Figure 8, because of the high-quality pretrained representations.
We have also made an attempt to evaluate the performance of the BYOL classifier with the other state-of-the-art CNN models in our quest to finalize a best possible model for classification of DRIL OCT images. Six CNN models that are pretrained on the ImageNet dataset, EfficientNetB3, ResNet50, InceptionResNetV2, MobileNetV2, DenseNet169, and VGG16, were fine-tuned to classify no-DRIL and DRIL OCT images. An almost identical training protocol has been used to train all six CNN models. The image size specifications were changed according to the requirements of the specific models to optimize the outcomes of the models. The parameters for which different values have been used are input image size and the type of augmentations used to get optimal results, which have been listed in Table 2. A stratified dataset split (80% for training, 10% of training data used for validation, and 20% for testing) has been used. Each model was evaluated for baseline performance with the backbone frozen and trained for 15 epochs, which was followed by complete fine-tuning of all the layers for another 15 epochs. All models were trained with the Adam optimizer and with a learning rate of 1 × 10 3 during the frozen stage and 1 × 10 4 during the fine-tuning phase. Across all architectures a batch size of 32 has been used with binary cross-entropy loss. The best model checkpoint obtained with minimum validation loss was saved for further evaluation. An independently held-out test set of 165 testing samples (20% of the entire dataset of 823 images) is used to assess the performance of all the fine-tuned models.

4. Results

Our experiments show that standard transfer learning baselines achieve strong performance on the DRIL classification task, with models such as ResNet50 and VGG16 reaching 98.18% accuracy. Table 3 provides a detailed comparison between our method and several pretrained state-of-the-art architectures. However, our proposed approach, Spatial BYOL with Hybrid loss, is able to outperform all baselines, achieving 99.39% accuracy with only a single misclassification on the test set.
Table 4 further contextualizes this improvement by comparing model sizes and convergence behavior. With 73 million parameters, of which only 27 million are updated during fine-tuning, our model performs better than the VGG-16 with over 138 million parameters and converges faster. Together, these results demonstrate that the proposed Spatial BYOL learns more expressive representations than existing pretrained models, enabling both faster convergence and improved final performance with substantially fewer trainable parameters.
The confusion matrix in Figure 9 reveals the model’s error pattern. Out of 78 No-DRIL cases, 77 were correctly classified, giving us a specificity of 98.72%. Our approach was able to identify all 86 DRIL cases correctly, yielding a 100% sensitivity with zero false negatives. This performance is particularly significant for clinical deployment, as failing to detect DRIL (false negatives) has greater clinical consequences than false alarms, which can be resolved through secondary review. Explainability becomes crucial for the AI model to be accepted clinically for DRIL detection. Gradient-weighted Class Activation Mapping (Grad-CAM) heatmaps provide useful information about which regions of the OCT images have contributed the most in making the DRIL prediction. Figure 10 and Figure 11 provide the Grad-CAM heatmaps obtained for various pretrained models along with the proposed spatial BYOL implementation.

5. Discussion

DRIL remains a difficult target for reliable detection in a clinical setting. In contrast to more overt retinal pathology such as large hemorrhages or exudates, DRIL appears as very subtle irregularity within the inner retinal layers, and often requires careful, targeted inspection for even experienced graders to confidently identify. These biomarkers and their associated structural changes alter the scan appearance only slightly, but still have clinical relevance, which makes algorithmic detection challenging. The problem is further amplified by the limited availability of high quality expert-annotated DRIL cases.
Supervised deep neural networks typically require large numbers of labeled examples to learn fine-grained and discriminative visual cues of this type. However, obtaining reliable labels for DRIL is costly, labor intensive, and slow, and this limits the achievable dataset scale in practice. Conventional transfer learning from natural image benchmarks also provides only marginal benefit because of the substantial domain mismatch. Feature extractors trained on everyday photographs tend to emphasize cues that are characteristic of object-centric scenes, such as sharp object boundaries, natural color statistics, and common surface textures. These learned inductive biases do not align well with retinal OCT, which is defined by layered biological structure, speckle characteristics, and subtle disease-related architectural changes.
To address the challenges of subtle pathological features, limited annotations, and domain shift, we propose leveraging self-supervised learning on large dataset of unlabeled OCT data. Our approach helps our model learn domain specific representations before being finetuned for pathology specific tasks. This class agnostic pretraining method allows the model to grasp basic OCT image characteristics, like retinal layer structures, tissue reflectivity patterns, speckle noise features, and anatomical spatial relationships without needing pathology labels. By training on diverse pathological conditions in an unsupervised manner, the model learns to encode retinal layer boundaries, variations in tissue texture, and structural organizations that are vital for generalizability. In comparison to supervised pretraining on a single pathology which would skew the learned features toward task-specific patterns and limit their transferability.
For OCT B-scans and for medical imaging, contrastive methods seem to be less well-suited. We hypothesize that this could be due to individual scans sharing highly similar global retinal structure, and disease-related changes often manifest as subtle, localized variations. This makes it difficult to construct truly “negative” examples. Images from different patients or even different disease categories can still be very similar at the global level, increasing the risk of false negatives and degrading representation quality. At the same time, our goal is to operate in relatively low-data and moderate-compute regimes, where transformer-based SSL approaches and masked-image-modeling variants, including DINO with VIT backbones, are typically data and computation intensive and therefore harder to deploy at full scale. We observe in our study that DINO with CNN-based backbones don’t perform equally well. In this context, self-distillation methods and BYOL offer a promising solution. BYOL does not rely on negative sampling, is compatible with convolutional backbones that naturally capture local retinal structure, and has been shown to perform well with smaller batch sizes and limited data. These properties make BYOL a natural choice for learning OCT-specific representations that remain sensitive to subtle pathology while respecting our computational constraints.
Our results establish self-supervised learning on domain-specific data as a viable strategy for creating foundation models in medical imaging. The pretrained BYOL encoder acts as a general feature extractor for OCT images. Capable of rapid adaptation for downstream tasks with minimal labelled data. The foundational model approach for specific imaging areas holds particular promise for rare pathologies and new biomarkers, where expert annotations are often limited. A single self-supervised pretraining phase on varied unlabeled OCT data can support multiple downstream applications, such as DRIL detection, drusen quantification, CNV classification, and layer segmentation through lightweight, task-specific heads. This approach amortizes computational cost of representation learning across various clinical applications.

6. Conclusions

We demonstrate a framework supported with self-supervised learning to address the critical issue of computerized DRIL detection with limited labeled data. Our research shows that domain-specific learning based on self-supervision can help improve over traditional transfer learning strategies, improved results and faster convergence with limited supervision show the ability of this approach to have potential to adapt to complex and rare pathologies where the labelled data is scarce. We acknowledge that testing on a limited dataset size may constrain the generalizability of our findings. Our future research focuses on exploring the adaptation of proposed self-supervised OCT pretraining to various other retinal imaging biomarkers and lesions. Further, investigation of self-supervised learning with other imaging modalities like fundus and OCT angiography incorporating the identification of different retinal disorders may be explored to move towards foundational models in ophthalmic diagnosis.

Author Contributions

Conceptualization, P.K.C., A.T., P.K., S.V.B. and S.F.; methodology, P.K.C., A.T., P.K. and G.M.; software, P.K.C., A.T. and P.K.; validation, P.K.C., A.T., P.K., G.M. and S.V.B.; formal analysis, P.K., G.M., S.V.B. and S.F.; investigation, P.K.C., A.T., P.K. and G.M.; resources, P.K.C. and S.V.B.; data curation, P.K.C. and S.V.B.; writing—original draft, P.K.C. and A.T.; writing—review & editing, P.K.C., A.T., P.K., G.M., S.V.B. and S.F.; visualization, P.K.C., A.T. and P.K.; supervision, P.K., G.M., S.V.B. and S.F.; project administration, P.K.C., A.T., P.K., G.M., S.V.B. and S.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Ethics Committee of Kasturba Medical College and Kasturba Hospital (Approval number: IEC1-287/2022 and date of approval: 3 February 2023).

Informed Consent Statement

The informed consent was waived as this is a retrospective study.

Data Availability Statement

The public data presented in the study are openly available in the Mendeley database at [https://doi.org/10.17632/rscbjbr9sj.3]. The private dataset generated and analyzed during this study is not publicly available due to ethical considerations. However, they can be obtained from the corresponding author upon reasonable request. Code for implementation can be found here [https://github.com/Tulsani/Spatial-Byol] (accessed on 2 November 2025).

Acknowledgments

The authors acknowledge the usage of retinal OCT images from the department of Ophthalmology, Kasturba Medical College Manipal, Manipal Academy of Higher Education, Manipal, Karnataka.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. International Diabetes Federation. Diabetes Facts & Figures. Available online: https://idf.org/about-diabetes/diabetes-facts-figures/ (accessed on 7 August 2025).
  2. Ruia, S.; Saxena, S.; Cheung, C.; Gilhotra, J.; Lai, T. Spectral domain optical coherence tomography features and classification systems for diabetic macular edema: A review. Asia-Pac. J. Ophthalmol. 2016, 5, 360–367. [Google Scholar] [CrossRef] [PubMed]
  3. Sun, J.; Lin, M.M.; Lammer, J.; Prager, S.; Sarangi, R.; Silva, P.S.; Aiello, L.P. Disorganization of the retinal inner layers as a predictor of visual acuity in eyes with Center-Involved diabetic macular edema. JAMA Ophthalmol. 2014, 132, 1309. [Google Scholar] [CrossRef] [PubMed]
  4. Grewal, D.; Jaffe, G.; Hariprasad, S. Role of disorganization of retinal inner layers as an optical coherence tomography biomarker in diabetic and uveitic macular edema. Ophthalmic Surg. Lasers Imaging Retin. 2017, 48, 282–288. [Google Scholar] [CrossRef] [PubMed]
  5. Singh, R.; Luo, S.; Hatipoglu, D.; Yuan, A.; Anand-Apte, B. Deep learning based tool for routine and rapid DRIL identification for patient with diabetes. Investig. Ophthalmol. Vis. Sci. 2020, 61, PB00144. [Google Scholar]
  6. Radwan, S.; Soliman, A.; Tokarev, J.; Zhang, L.; van Kuijk, F.; Koozekanani, D. Association of disorganization of retinal inner layers with vision after resolution of center-involved diabetic macular edema. JAMA Ophthalmol. 2015, 133, 820–825. [Google Scholar] [CrossRef] [PubMed]
  7. Sun, J.; Radwan, S.H.; Soliman, A.Z.; Lammer, J.; Lin, M.M.; Prager, S.G.; Silva, P.S.; Aiello, L.B.; Aiello, L.P. Neural retinal disorganization as a robust marker of visual acuity in current and resolved diabetic macular edema. Diabetes 2015, 64, 2560–2570. [Google Scholar] [CrossRef]
  8. Acón, D.; Wu, L. Multimodal imaging in diabetic macular edema. Asia-Pac. J. Ophthalmol. 2018, 7, 22–27. [Google Scholar]
  9. Das, R.; Spence, G.; Hogg, R.; Stevenson, M.; Chakravarthy, U. Disorganization of inner retina and outer retinal morphology in diabetic macular edema. JAMA Ophthalmol. 2018, 136, 202. [Google Scholar] [CrossRef] [PubMed]
  10. Joltikov, K.; Sesi, C.A.; de Castro, V.M.; Davila, J.R.; Anand, R.; Khan, S.M.; Farbman, N.; Jackson, G.R.; Johnson, C.A.; Gardner, T.W. Disorganization of retinal inner layers (DRIL) and neuroretinal dysfunction in early diabetic retinopathy. Investig. Ophthalmol. Vis. Sci. 2018, 59, 5481. [Google Scholar] [CrossRef] [PubMed]
  11. Nakano, E.; Ota, T.; Jingami, Y.; Nakata, I.; Hayashi, H.; Yamashiro, K. Correlation between metamorphopsia and disorganization of the retinal inner layers in eyes with diabetic macular edema. Graefe’s Arch. Clin. Exp. Ophthalmol. 2019, 257, 1873–1878. [Google Scholar] [CrossRef] [PubMed]
  12. Nadri, G.; Saxena, S.; Stefanickova, J.; Ziak, P.; Benacka, J.; Gilhotra, J.S.; Kruzliak, P. Disorganization of retinal inner layers correlates with ellipsoid zone disruption and retinal nerve fiber layer thinning in diabetic retinopathy. J. Diabetes Its Complicat. 2019, 33, 550–553. [Google Scholar] [CrossRef] [PubMed]
  13. Di-Luciano, A.; Lam, W.C.; Velasque, L.; Kenstelman, E.; Torres, R.M.; Alvarado-Villacorta, R.; Nagpal, M. Disorganization of the inner retinal layers in diabetic macular edema: Systematic review. Rev. Bras. Oftalmol. 2022, 81, e0027. [Google Scholar] [CrossRef]
  14. Singh, R.; Singuri, S.; Batoki, J.; Lin, K.; Luo, S.; Hatipoglu, D.; Anand-Apte, B.; Yuan, A. Deep learning algorithm detects presence of disorganization of retinal inner layers (dril)—An early imaging biomarker in diabetic retinopathy. Transl. Vis. Sci. Technol. 2023, 12, 6. [Google Scholar] [CrossRef] [PubMed]
  15. Tripathi, A.; Kumar, P.; Tulsani, A.; Chakrapani, P.K.; Maiya, G.; Bhandary, S.V.; Mayya, V.; Pathan, S.; Achar, R.; Acharya, U.R. Fuzzy Logic-Based System for Identifying the Severity of Diabetic Macular Edema from OCT B-Scan Images Using DRIL, HRF, and Cystoids. Diagnostics 2023, 13, 2550. [Google Scholar] [CrossRef] [PubMed]
  16. Singuri, S.; Luo, S.; Hatipoglu, D.; Nowacki, A.S.; Patel, R.; Schachat, A.P.; Ehlers, J.P.; Singh, R.P.; Anand-Apte, B.; Yuan, A. Clinical utility of Spectral-Domain optical coherence tomography marker disorganization of retinal inner layers in diabetic retinopathy. Ophthalmic Surg. Lasers Imaging Retin. 2023, 54, 692–700. [Google Scholar] [CrossRef] [PubMed]
  17. Toto, L.; Romano, A.; Pavan, M.; Degl’Innocenti, D.; Olivotto, V.; Formenti, F.; Viggiano, P.; Midena, E.; Mastropasqua, R. A deep learning approach to hard exudates detection and disorganization of retinal inner layers identification on OCT images. Sci. Rep. 2024, 14, 16652. [Google Scholar] [CrossRef] [PubMed]
  18. Ruiz-Medrano, J.; Udaondo Mirete, P.; Fernández-Jiménez, M.; Asencio-Duran, M.; Fernández-Vigo, J.I.; Medina-Baena, M.; Flores-Moreno, I.; Pareja-Esteban, J.; Touhami, S.; Giocanti-Aurégan, A.; et al. Biomarkers of risk of switching to dexamethasone implant for the treatment of diabetic macular oedema in real clinical practice: A multicentric study. Br. J. Ophthalmol. 2025, 109, 1155–1160. [Google Scholar] [CrossRef] [PubMed]
  19. Tripathi, A.; Gaur, S.; Agarwal, R.; Singh, N.; Singh, A.; Parveen, S.; Singh, N.; Rima, N. Disorganization of retinal inner layers as an optical coherence tomography biomarker in diabetic retinopathy: A review. Indian J. Ophthalmol. 2025, 73, 1245–1250. [Google Scholar] [CrossRef] [PubMed]
  20. Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
  21. Kermany, D.S.; Goldbaum, M.; Cai, W.; Valentim, C.C.; Liang, H.; Baxter, S.L.; McKeown, A.; Yang, G.; Wu, X.; Yan, F.; et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell 2018, 172, 1122–1131.e9. [Google Scholar] [CrossRef]
Figure 1. The figure at the left shows no DRIL highlighting the central 1000 μm foveal region of the OCT image. The ganglion cell layer-inner plexiform layer-inner nuclear layer (GCL-IPL-INL) interface is marked by the white broken line, while the INL-outer plexiform layer (INL-ONL) and OPL-outer nuclear layer (OPL-ONL) interfaces are highlighted by the yellow dashed and blue solid lines, respectively. The INL-OPL and OPL-ONL boundaries cannot be differentiated in the DRIL area indicated by the white double arrows in the right panel [4].
Figure 1. The figure at the left shows no DRIL highlighting the central 1000 μm foveal region of the OCT image. The ganglion cell layer-inner plexiform layer-inner nuclear layer (GCL-IPL-INL) interface is marked by the white broken line, while the INL-outer plexiform layer (INL-ONL) and OPL-outer nuclear layer (OPL-ONL) interfaces are highlighted by the yellow dashed and blue solid lines, respectively. The INL-OPL and OPL-ONL boundaries cannot be differentiated in the DRIL area indicated by the white double arrows in the right panel [4].
Diagnostics 15 03221 g001
Figure 2. BYOL-based self-supervised training pipeline to learn pathologies in low data regime.
Figure 2. BYOL-based self-supervised training pipeline to learn pathologies in low data regime.
Diagnostics 15 03221 g002
Figure 3. Panel (A) shows confusion matrix, Panel (B) shows agreement distribution, Panel (C) Cohen’s κ and Panel (D) shows the disagrement pattern.
Figure 3. Panel (A) shows confusion matrix, Panel (B) shows agreement distribution, Panel (C) Cohen’s κ and Panel (D) shows the disagrement pattern.
Diagnostics 15 03221 g003
Figure 4. Figure showing the gradual decay of the learning rate over 125 epochs, enabling stable convergence of spatial BYOL.
Figure 4. Figure showing the gradual decay of the learning rate over 125 epochs, enabling stable convergence of spatial BYOL.
Diagnostics 15 03221 g004
Figure 5. Curves indicating global loss vs. spatial loss during training.
Figure 5. Curves indicating global loss vs. spatial loss during training.
Diagnostics 15 03221 g005
Figure 6. The complete training loss curve indicates stable representation learning.
Figure 6. The complete training loss curve indicates stable representation learning.
Diagnostics 15 03221 g006
Figure 7. Finetuning Loss Curves for DRIL Classification.
Figure 7. Finetuning Loss Curves for DRIL Classification.
Diagnostics 15 03221 g007
Figure 8. Validation Accuracy curves for DRIL fine-tuning. A 10% validation split (66 images) was held out from the training set to monitor generalization and prevent overfitting.
Figure 8. Validation Accuracy curves for DRIL fine-tuning. A 10% validation split (66 images) was held out from the training set to monitor generalization and prevent overfitting.
Diagnostics 15 03221 g008
Figure 9. Confusion Matrix for the proposed spatial BYOL.
Figure 9. Confusion Matrix for the proposed spatial BYOL.
Diagnostics 15 03221 g009
Figure 10. Grad-CAM heatmaps obtained for various models: (a) VGG16 (b) EfficientNetB3 (c) MobileNetV2 (d) ReNet50.
Figure 10. Grad-CAM heatmaps obtained for various models: (a) VGG16 (b) EfficientNetB3 (c) MobileNetV2 (d) ReNet50.
Diagnostics 15 03221 g010aDiagnostics 15 03221 g010b
Figure 11. Grad-CAM heatmaps obtained for various models: (a) DenseNet169 (b) InceptionResNetV2 (c) Spatial BYOL.
Figure 11. Grad-CAM heatmaps obtained for various models: (a) DenseNet169 (b) InceptionResNetV2 (c) Spatial BYOL.
Diagnostics 15 03221 g011aDiagnostics 15 03221 g011b
Table 1. Summary of related studies on DRIL classification using OCT images.
Table 1. Summary of related studies on DRIL classification using OCT images.
ReferenceDataset/No. ImagesAccuracySensitivitySpecificityCohen’s κ AUCMCC
Sun et al. [3]120 eyesNRNRNRNRNRNR
Radwan et al. [6]70 eyes (43 DME,
27 non-DME)
NRNRNR κ = 0.88–1.00NRNR
Sun et al. [7]80 eyesNRNRNR κ = 0.69–0.77NRNR
Acón & Wu [8]Review studyNRNRNRNRNRNR
Das et al. [9]102 eyes (80 patients)NRNRNR κ = 0.6–1.0NRNR
Joltikov et al. [10]57 diabetic, 18 controlsNRNRNRr = 0.98NRNR
Nakano et al. [11]37 eyesNRNRNRCC = 0.93–0.99NRNR
Nadri et al. [12]104 subjects (78 diabetic)NRNRNR κ = 0.85NRNR
Di-Luciano et al. [13]7 studies
(systematic review)
NRNRNRNRNRNR
Singuri et al. [16]2083 eyes (1175 patients)NRNRNR κ = 0.88NRNR
Singh et al. [5]2392 images (417 eyes,
229 patients)
94.36%NRNRNR0.988NR
Singh et al. [14]5992 images (1201 eyes)88.3%82.9%90.0% κ > 0.85NR0.7
Tripathi et al. [15]150 images93.3%NRNRNRNRNR
Toto et al. [17]442 images91.1%91.1%91.1% κ = 0.8291%NR
Ruiz-Medrano et al. [18]275 eyes (209 switch,
66 control)
NRNRNRNRNRNR
NR: Not Reported in the original study.
Table 2. Input image sizes and the type of augmentations used for the CNN models.
Table 2. Input image sizes and the type of augmentations used for the CNN models.
ModelInput SizeAugmentations
DenseNet169224 × 224Resize; Random Horizontal Flip; Rotation (±10°); Color Jitter; ImageNet Normalization
EfficientNetB3300 × 300Resize; Random Horizontal Flip; Rotation (±20°); Color Jitter; Random Zoom; RandomResizedCrop; ImageNet Normalization
ResNet50224 × 224Resize; Random Horizontal Flip; Rotation (±15°); Color Jitter; RandomAffine; ImageNet Normalization
VGG16224 × 224Resize; Random Horizontal Flip; Rotation (±10°); Color Jitter; ImageNet Normalization
MobileNetV2224 × 224Resize; Random Horizontal Flip; Rotation (±10°); Color Jitter; ImageNet Normalization
InceptionResNetV2299 × 299Resize; Random Horizontal Flip; Rotation (±10°); Color Jitter; Normalization (mean = 0.5, std = 0.5)
Table 3. Comparison of DRIL Classification Performance Across Different Approaches.
Table 3. Comparison of DRIL Classification Performance Across Different Approaches.
Author/ApproachAccuracy (%)Precision (%)Recall (%)F1-Score (%)
Pretrained VGG16 + Finetuning98.18%96.63%100.00%98.29%
Pretrained EfficientNetB3 + Finetuning96.96%96.55%97.67%97.10%
Pretrained MobileNetV2 + Finetuning97.57%96.59%98.83%97.70%
Pretrained ResNet50 + Finetuning98.18%96.62%100.00%98.28%
Pretrained DenseNet169 + Finetuning97.57%96.59%98.83%97.70%
Pretrainied InceptionResNetV2 + Finetuning96.96%96.55%97.67%97.10%
Pretrained VIT-16 + Finetuning98.18%98.19%98.18%98.18%
Pretrained VIT-32 + Finetuning96.36%96.62%96.36%96.37%
Moco Backbone + CNN Finetune Head98.79%98.79%98.79%98.79%
SimCLR Bacbone + CNN Finetune Head95.76%96.08%95.76%95.74%
Dino Bacbone + CNN Finetune Head96.96%96.55%97.67%97.10%
BYOL + CNN Finetune Head98.79%98.82%98.79%98.79%
Proposed/Spatial BYOL + Finetune + Hybrid Spatial Aware Loss Fn99.39%99.40%99.39%99.39%
Table 4. Model and respective sizes comparison with other SSL based approaches.
Table 4. Model and respective sizes comparison with other SSL based approaches.
ModelTotal ParamaetersConvergence Epochs for DRIL Classification
VGG-16138 Million30
EfficientNetB312 Million20
ResNet5025 Million22
DenseNet16912 Million18
InceptionResNetV255 Million26
VIT-B 1686 Million18
Vit-B 3288 Million16
SimCLR27 Million28
Moco55 Million29
Dino98 Million33
BYOL68 Million22
Spatial BYOL73 Million19
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kodiyalbail Chakrapani, P.; Tulsani, A.; Kumar, P.; Maiya, G.; Bhandary, S.V.; Fernandes, S. Self-Supervised Representation Learning for Data-Efficient DRIL Classification in OCT Images. Diagnostics 2025, 15, 3221. https://doi.org/10.3390/diagnostics15243221

AMA Style

Kodiyalbail Chakrapani P, Tulsani A, Kumar P, Maiya G, Bhandary SV, Fernandes S. Self-Supervised Representation Learning for Data-Efficient DRIL Classification in OCT Images. Diagnostics. 2025; 15(24):3221. https://doi.org/10.3390/diagnostics15243221

Chicago/Turabian Style

Kodiyalbail Chakrapani, Pavithra, Akshat Tulsani, Preetham Kumar, Geetha Maiya, Sulatha Venkataraya Bhandary, and Steven Fernandes. 2025. "Self-Supervised Representation Learning for Data-Efficient DRIL Classification in OCT Images" Diagnostics 15, no. 24: 3221. https://doi.org/10.3390/diagnostics15243221

APA Style

Kodiyalbail Chakrapani, P., Tulsani, A., Kumar, P., Maiya, G., Bhandary, S. V., & Fernandes, S. (2025). Self-Supervised Representation Learning for Data-Efficient DRIL Classification in OCT Images. Diagnostics, 15(24), 3221. https://doi.org/10.3390/diagnostics15243221

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop