Explainable AI for Diabetic Retinopathy: Utilizing YOLO Model on a Novel Dataset

Mutawa, A. M.; Al Sabti, Khalid; Raizada, Seemant; Sruthi, Sai

doi:10.3390/ai6120301

Open AccessArticle

Explainable AI for Diabetic Retinopathy: Utilizing YOLO Model on a Novel Dataset

by

A. M. Mutawa

^1,*

,

Khalid Al Sabti

²,

Seemant Raizada

² and

Sai Sruthi

¹

Department of Computer Engineering, College of Engineering and Petroleum, Kuwait University, Safat 13060, Kuwait

²

Kuwait Specialized Eye Center, Kuwait City 35453, Kuwait

^*

Author to whom correspondence should be addressed.

AI 2025, 6(12), 301; https://doi.org/10.3390/ai6120301

Submission received: 9 October 2025 / Revised: 14 November 2025 / Accepted: 20 November 2025 / Published: 24 November 2025

(This article belongs to the Special Issue Artificial Intelligence in Biomedical Engineering: Challenges and Developments)

Download

Browse Figures

Versions Notes

Abstract

Background: Diagnostic errors can be substantially diminished, and clinical decision-making can be significantly enhanced through automated image classification. Methods: We implemented a YOLO (You Only Look Once)-based system to classify diabetic retinopathy (DR) utilizing a unique retinal dataset. Although YOLO provides exceptional accuracy and rapidity in object recognition and categorization, its interpretability is constrained. Both binary and multi-class classification methods (graded severity levels) were employed. The Contrast-Limited Adaptive Histogram Equalization (CLAHE) model was utilized to improve image brightness and detailed readability. To improve interpretability, we utilized Eigen Class Activation Mapping (Eigen-CAM) to display areas affecting classification predictions. Results: Our model exhibited robust and consistent performance on the datasets for binary and 5-class tasks. The YOLO 11l model obtained a binary classification accuracy of 97.02% and an Area Under Curve (AUC) score of 0.98. The YOLO 8x model showed superior performance in 5-class classification, with an accuracy of 80.12% and an AUC score of 0.88. A simple interface was created using Gradio to enable real-time interaction. Conclusions: The suggested technique integrates robust prediction accuracy with visual interpretability, rendering it a potential instrument for DR screening in clinical environments.

Keywords:

diabetic retinopathy; deep learning; medical image classification; novel DR dataset; YOLO model

Graphical Abstract

1. Introduction

Millions globally endure diabetes, a chronic metabolic disease that has a long-term negative impact on general health, with the eyes being especially susceptible. Diabetic retinopathy (DR) is one of the most common and significant side effects of diabetes. If it is not diagnosed and treated very soon, it can cause vision loss and perhaps blindness [1]. It is the primary cause of global vision impairment, affecting around 33% of adults with diabetes. So, finding and treating DR quickly can greatly lower the chances of permanent vision loss and other problems with vision. About 60% of people with type 2 diabetes and almost all people with type 1 diabetes will develop retinopathy within 20 years of being diagnosed. However, DR can get worse without any symptoms until it compromises eyesight [2]. One in three diabetics is impacted by DR or is at risk of getting it [3]. Regular screening is essential for people with diabetes since this sneaky eye condition develops gradually and frequently shows no signs in its early stages. Detection of DR often entails a fundus examination utilizing sophisticated imaging techniques [4].

Diabetes can harm blood vessels, which can make it harder for blood to get to the retina and compromise eyesight [5]. The high costs and limited access to ophthalmological services result in less than 60% of diabetic patients receiving regular eye examinations. Clinical assessments or remote evaluations of color fundus photographs comprise DR’s gold standard screening method [6]. There are different stages of DR, and each one results in more retinal damage. Non-proliferative diabetic retinopathy (NPDR), the initial stage, is typified by cotton wool patches, hemorrhages, and microaneurysms. Proliferative diabetic retinopathy (PDR), a more advanced stage of NPDR marked by neovascularization, may develop as the condition progresses and has potentially blinding effects [1]. For focused therapies like blood pressure management and glucose control, early identification of DR is essential [7]. Analyzing risk factors such as blood pressure, blood glucose, lipid levels, genetic predispositions, and the length of diabetes is necessary to predict the course of the disease.

Artificial Intelligence (AI) has revolutionized automatic image analysis, enabling AI to help find lesions, diagnose them, and write reports, replacing labor-intensive interpretation and expertise required by ophthalmologists [8]. Various physical therapies exist for DR, including laser treatment for PDR, intravitreal injections for maculopathy, steroid eye implants, and surgical interventions [9]. These are some essential ways to decelerate the advancement of DR. Medical image analysis has been transformed by deep learning (DL) models, a subset of machine learning (ML), especially in ophthalmology [10]. They can scale up effectively through learning and improving with datasets, making them suitable for screening programs [10]. These models outperform humans in identifying and classifying pathological changes in DR stages after being trained on big datasets such as retinal fundus pictures.

Ophthalmologists grade a sizable dataset of retinal pictures used to train these DL models for detecting and classifying DR. The data is processed by Convolutional Neural Networks (CNNs), which use images to learn hierarchical characteristics [11]. To reduce discrepancies between forecasts and actual DR ratings, the algorithms alter their parameters after analyzing dozens or millions of photos. Diabetic macular edema is also detected by certain models [12]. It can provide rapid, objective assessments, reduce healthcare professionals’ workload, and aid in precise disease progression monitoring.

YOLO (You Only Look Once) is a framework for detecting items in real time that can quickly find and categorize objects in images. Neural networks are used to accomplish two important computer vision tasks: object identification and classification. YOLO is an advanced, real-time object identification framework that analyzes complete photos in a single iteration [13]. The fundamental innovation lies in concurrently forecasting bounding boxes and class probabilities on a grid overlaying the image. The “one-shot” methodology renders YOLO remarkably swift, facilitating real-time performance essential for applications like autonomous driving and video monitoring, while preserving excellent accuracy [14]. It is a new CNN solution that breaks through tradition and strikes a great balance between precision and speed [15]. Researchers have embraced YOLO models for detecting objects and segmentation applications because of their open-source characteristics and adaptability. The timeline of the YOLO model is depicted in Figure 1, and the key features are mentioned in Table 1.

Our study gives a detailed comparison of performance across binary and multi-class classification settings, YOLO versions 8 and 11, and model complexity, using both quantitative measures and qualitative visual explanations. We use the Eigen Class Activation Mapping (Eigen-CAM) to understand the model. Models with clear architecture are superior to situations that need high interpretability, including helping with clinical diagnosis [28]. The Contrast-Limited Adaptive Histogram Equalization (CLAHE) technique makes the image’s contrast and details more visible. This preprocessing phase is very important for making retinal pictures clearer, which makes it easier to classify them.

The proposed framework’s contribution is its comprehensive, clinically informed integration of robust feature processing, precise classification, and essential visual explainability, despite the extensive application of DL methods, including various YOLO iterations, in DR classification. The major contributions of this study are as follows:

We provide a unique, carefully chosen collection of retinal fundus images that are particularly made for classifying DR. The dataset has well-labeled images that may be used for both binary (DR vs. Normal) and multi-class (with different levels of severity) classification tasks.
We use five different types of models (nano, small, medium, large, and extra-large) to do both binary and multi-class DR classification. We do this by using the latest image classification architectures, YOLOv8 (or YOLO 8) and YOLOv11 (or YOLO 11). This comparison of designs and configurations gives a strong picture of how well the model works. Attaining superior performance in both classification tasks using efficient architecture signifies an innovative effectiveness for clinical implementation.
CLAHE was implemented during preprocessing to enhance the visibility of retinal features. This improved the model’s capacity to recognize subtle clinical patterns and the quality of the images.
To improve model transparency, we utilize Eigen-CAM to display and elucidate the decision-making process of the best YOLO models. This facilitates a clear comprehension of the image regions that most significantly impacted the classification results. Adding Eigen-CAM not only made it easier to understand, but it also lets ophthalmologists check the location of diseased areas, which is important for clinical use.
The study provides both automated categorization and visual explainability to help ophthalmologists make decisions.
We developed an interactive model demonstration with Gradio to provide real-time evaluation. This tool not only enables the presentation and evaluation of the model’s functioning but also establishes a foundation for the future construction of a comprehensive application.

The subsequent portion of the paper is organized as follows: Section 2 provides an overview of pertinent research in DR classification. Section 3 delineates the suggested technique, encompassing dataset attributes, model architectures (YOLOv8 and YOLOv11), and the classification process for both binary and multi-class tasks. The section also delineates the experimental configuration and the assessment criteria employed in the study. Section 4 explores the outcomes acquired, while Section 5 provides an in-depth discussion of the model’s performance, generalizability, and interpretability. Section 6 ultimately finishes the work by summarizing the principal findings and outlining prospective research avenues.

2. Literature Review

The proposed study’s purpose is to employ statistical metrics and pretrained deep neural networks to build automatic ways to find DR early. Most AI systems for the retina use single-disease models that do not show how complicated real-world differential diagnosis is. To overcome this constraint, current efforts have concentrated on integrated deep learning frameworks proficient at identifying various retinal diseases from a singular fundus image [29].

The research conducted by Singh et al., [30] introduced an AI-driven system for resilient multiclass DR classification utilizing fundus images, underpinned by preprocessing techniques like pixel normalization and geometric correction. A hierarchical model that combines a CNN, DenseNet-121, and a RefineNet-U is utilized for the classification. They achieved 99% accuracy with DDR dataset. Another study by Youldash et al. [31] tested their DL models on both the publicly accessible APTOS dataset and a private clinical dataset from the Al-Saif Medical Center in Saudi Arabia. The result showed that EfficientNetB3 achieved a 98.2% accuracy and EfficientNetV2B1 achieved an 84.4% accuracy in binary and muti-class DR classification, respectively.

The study by Zaylaa et al. showed that ResNet-50 is better at detecting DR than GoogleNet. It has a sensitivity and accuracy of 99.44% [4]. For the classification of DR, the study by Wei et al. [32] suggested a unique Multi-scale Spatial-aware Transformer Network (MSTNet). To extract local information and global context from images, the network makes use of a cross-fusion classifier, multiple instance learning, a dual-pathway backbone network, and a spatial-aware module. After being tested on four publicly available DR datasets, the MSTNet demonstrated exceptional diagnostic and grading accuracy, increasing F1 scores by 1.2% and ACC by up to 2.0%. Minyar et al. [33] recommended using a deep ConvNet architecture to look at fundus pictures and tell the difference between healthy, mild, and severe DR. This methodology makes it easier to find and stage problems early, which means that retina experts are no longer needed, and more people can receive eye treatment.

In a paper by Sharma et al. [34], they proposed an adaptive Gabor filter utilizing a Chaotic Map to enhance filtering performance. The system was examined using three datasets: DiaRetDB1, APTOS 2019, and EyePacs, confirming its robustness and dependability. The Gradient-weighted Class Activation Mapping (Grad-CAM) facilitates proficient segmentation and classification, with the model attaining 99.01% accuracy on the DiaRetDB1 dataset, 98.98% on APTOS 2019, and 99.12% on the EyePacs dataset. Their research presents a hybrid model that combines DL algorithms with image processing techniques for the early identification of DR. The model utilizes ResNet50, InceptionV3, and VGG-19 for feature extraction and classification, employing a hybrid attention-based stacking ensemble to enhance accuracy.

A new method for automatic grading DR using visual explanations was proposed by Herrero-Tudela et al. [35] based on the ResNet-50 network and SHapley Additive exPlanations (SHAP). Their method achieved accuracy rates of 94.64%, 86.36%, 84.23%, 82.79%, and 85.65% using five retinal image databases. Their research identified alterations in retinal vasculature as risk indicators and peripheral retinal observations as essential for predicting the course of DR. Experiments performed on the APTOS-2019 and EyePacs datasets exhibit enhanced accuracy, precision, and recall [5]. With 98% accuracy, it shows a system that uses the Benchmark APTOS 2019 Gaussian-filtered DR image dataset and a custom CNN model. Posham et al. [36] used a local interpretable model-agnostic explanation (LIME) visualization model to make things clearer and easier to understand.

Ohri et al. [37] looked into how transfer learning may make AI medical models better at determining the severity of DR. Their research indicated that supervised pre-training on ImageNet, followed by fine-tuning on labeled domain-specific fundus pictures, significantly improves the model’s performance when employing the entire training dataset. However, the model exhibits diminished performance in low-data scenarios, highlighting the limitations of supervised learning with insufficient annotated data. Naz et al. [38] wrote about a new way to test for early DR that involves segmentation and unsupervised learning. It adds a neural network energy-based model to the Fuzzy C-Means method, which makes it more accurate and faster. The approach worked 99.03% of the time when there was no noise and 93.13% of the time when there was noise. It took an average of 16.1 s to run. The Modified Generative Adversarial-based Crossover Salp Grasshopper approach is a DL strategy that uses fundus imaging datasets to detect and classify diabetic retinal disease early on [39]. It uses pre-processing, CNNs, and a generative adversarial network model to extract features, which improves accuracy, precision, recall, specificity, and the F1 metric. Manohar et al. [10] also built a novel CNN model to find DR in retinal pictures. Based on accuracy, precision, recall, and F1-score measures, the model does better than VGG19 and ResNet101.

A few studies have investigated how to enhance classification performance by using different models. One method was to train DL algorithms on a six-class diabetic retinopathy (DDR) dataset that contained both original and four different preprocessed picture sets, as well as photos that were too bad to be graded. The study indicated that ensemble models, which included this data variance, were more accurate overall and had a higher Cohen’s Kappa than individual DL architectures [3]. Bosale et al. [40] did a study that used data preprocessing and picture enhancement methods to find and categorize DR.

A new version of grasshopper optimization is introduced by Bhimavarapu et al. [41] that works quite well, even on small lesions. Their study used an improved Naïve Bayes classifier to sort the different types of DR. The approach beats current methods by 99.98% on the APTOS dataset. Nahiduzzaman et al. [42] created a parallel CNN for feature extraction and an extreme learning machine for DR classification. For the early diagnosis of optical coherence tomography photos, a DL model based on Swin Transformer version 2 is recommended [43]. The model uses self-attention, the PolyLoss function, and heat maps to make its predictions more accurate. The model was 99.9% accurate, which shows that it can automatically classify several fundus illnesses.

The work by Acosta et al., [44] conducts a comparative analysis of two advanced CNN architectures, ResNeXt and RegNet, for the automated classification of DR utilizing the APTOS dataset. To show anatomically and pathologically important retinal areas, they employed SHAP-based explainability. Another study by Touati et al., [45] employed APTOS dataset for DR classification with Transformer model. They achieved 95% validation accuracy, demonstrating its capacity to improve automated diagnosis and facilitate clinical analysis.

YOLO Models

Rahman and Chandra [46] developed an automated detection of DR and age-related macular degeneration (AMD) based on a multi-model method that may give very accurate results. For finding and separating lesions, it uses the latest object detection models, such as YOLOv8, YOLOv7, and YOLOv5. They used CNN, Support Vector Machines (SVM), and Random Forests (RF) together to conduct classification tasks. In another study by Geetha et al. [47], a YOLOv8-based DL approach for multi-class detection and classification of DR from fundus images was employed. It performed acceptably with a mean average accuracy of 0.5 and an F1 score of 0.31 across all classes. They looked at YOLOv8 next to CNN, SVM, VGG16, and ResNet50. Zero padding, max pooling, and object labeling are all parts of preprocessing.

Rizzieri et al. [48] applied 100 DR pictures from the MESSIDOR collection to examine how well the YOLOv8 and YOLOv9 architectures could segment DR fundus features. Their suggested methods had an adequate mean average accuracy for finding DR lesions and a posterior pole signature. Another study by Zhang et al. [49] used an automated method with the YOLO model to find microaneurysms in fluorescein angiography (FFA) photos, which are an important sign of DR. The YOLO model did a better job at finding microaneurysms than other models, with an optimum F1 score of 92.85%.

Most of the earlier research using the YOLO model was done to find things like lesions and objects in pictures. Wahab et al. [50] employed YOLO v7 as a feature extractor in detecting DR. They used the MobileNet V3 pretrained model to sort DR into groups using the APTOS and EyePACS datasets. They obtained an F1 score of 93.7% for the APTOS and 93.1% for the EyePACS. But there are still problems, such as the necessity for expert confirmation and the fact that it might be hard to find tiny lesions. Research like this was done on the Kaggle dataset using the RF model and YOLO for feature extraction, and it achieved a 99% accuracy [51].

Santos et al. [52] utilized the YOLOv5 model to find the fundus lesion in two public datasets that included four labels. The F1 score was 0.2521, which means that the proposed method has potential but needs more work and testing to make it more accurate and useful in a wider range of situations. Mahapadi et al. [53] came up with a real-time DR detection framework that uses YOLOv10 and optimization methods based on nature. Their research showed that combining optimization techniques might greatly enhance the accuracy of feature selection and detection, making real-time classification possible. This method met the need for DR screening systems to be fast and reliable, especially in clinical settings where resources are limited.

The YOLO object identification framework is becoming more popular in medical picture analysis, but it is also extensively used in other fields since it is fast, accurate, and works in real time. Liao et al. [54] suggested YOLO-MECD, a YOLOv11-based model for finding citrus fruits with high accuracy in agriculture. This model can handle problems such as occlusion and changing lighting conditions. Similarly, Mao and Hong [14] provided a thorough analysis of YOLO-based methods from v1 to v11 for finding defects in textiles in real time. They focused on how adaptable and reliable the model is in the actual world. These non-medical uses show that YOLO models may be used in a wide range of situations to solve difficult visual problems in real life. Their success in many disciplines shows how strong the framework is and why it is being used in this investigation to find DR. The background studies are represented in Table 2.

3. Materials and Methods

The proposed framework for the DR diagnosis and grading is depicted in Figure 2. The procedure commences with a raw retinal picture, which initially undergoes critical pre-processing steps: resizing the image to a consistent dimension and applying the CLAHE method are both essential for enhancing neural network performance. The pre-processed dataset is systematically partitioned into training data for model pattern learning, validation data for parameter optimization, and overfitting prevention during training and testing data for an impartial final assessment of the model’s performance on novel instances. The system is fundamentally based on the YOLO model, particularly utilizing versions such as “YOLO v8 and v11,” which offers various model scales such as nano (n), small (s), medium (m), large (l), and extra-large (x) to optimize computing efficiency and performance. A pivotal element is fine-tuning, wherein the pre-trained YOLO model’s weights that are trained on ImageNet can be used for training the dataset. The model’s predictions result in classification, which can be executed in two unique ways: a binary class outcome denoting the DR or Normal, and a more detailed multi-class classification showing severity levels (Normal, Mild, Moderate, Severe, PDR).

The classification system’s efficacy is quantitatively assessed using various performance metrics, such as Accuracy, Precision, Recall, F1-Score, MCC (Matthews Correlation Coefficient), and Area Under the Curve (AUC), offering a thorough evaluation of the model’s prediction capability. To enable interaction with the trained model and demonstrate its functionalities, we created an interactive web-based demonstration with Gradio [56].

3.1. Dataset

The collection of retinal images from patients in Kuwait at the Kuwait Specialized Eye Center constitutes an important milestone in our research. Incorporating this localized dataset into the study will guarantee greater specificity and relevance to the Kuwaiti population. The Ethical approval for collecting the images was granted by the Office of the Vice Dean for Academic Affairs, Research and Graduate Studies, College of Engineering and Petroleum, Kuwait University (reference number: 24/2/757, 15 October 2023). Also, the procedure and the consent form were approved by the Kuwait Specialized Eye Center committee.

In the initial stage, around 1096 images were collected from the Kuwait Specialized Eye Center. Patients are selected from the retina clinic, specifically those with a history of Diabetes mellitus. A detailed ophthalmic history is recorded in the medical file. These patients undergo a dilated fundus examination and fundus photography as part of the routine procedure. Before this, informed consent is obtained from each patient to utilize their anonymous fundus pictures for research purposes.

To ensure optimal examination conditions, patients receive pharmacological dilation of the pupil using Tropicamide 0.5% eye drops. It involves instilling one or two drops of the 0.5% solution into each eye 15 to 20 min before the examination. Fundus pictures are then captured using the CLARUS 500 fundus camera from Zeiss. The parameters of the imaging modes are as follows:

Color: RGB;
Field of view: 133°;
Optical resolution: 7.3 µm;
Minimum Pupil Diameter: 2.5 mm;
Image Format: PNG;
Width: Varying (1366 pixels and 1920 pixels);
Height: Varying (991 pixels and 679 pixels).

3.1.1. Cleaning the Dataset

Any pictures with hazy media caused by corneal opacity or cataracts are excluded to maintain the integrity of the study data. The images collected were classified into five stages of DR under the supervision of ophthalmologists. It is classified into the normal, non-proliferative diabetic retinopathy (NpDR) (such as mild, moderate, or severe), and proliferative diabetic retinopathy (PDR). The images were re-examined by the ophthalmologists.

The International Clinical Diabetic Retinopathy (ICDR) scale was used to score all the images. Two ophthalmologists assessed the images on their own, and a senior retinal expert settled any discrepancies. The dataset used to train the model only has the final adjudicated labels, which makes sure that the ground truth is the same across the dataset.

Finally, the total number of retinal images is 806, with 319 in the normal folder, 69 in mild NpDR, 91 in moderate NpDR, 69 in severe NpDR, and 258 in PDR. Table 3 shows detailed information on the dataset size. For binary classification, all other folders except normal are merged and named as DR; thus, we have two classes for the prediction (normal and DR).

3.1.2. Sample Image

Figure 3 depicts random raw images belonging to the five DR stages from the dataset. Figure 4 shows the RGB channel of the original image we took from the Mild NpDR category of DR (a random image). We examined the appearance of distinct color channels.

3.1.3. Image Quality

The quality of the image is checked using no-reference image quality matrices. There are three methods, namely Blind/Reference-less image spatial quality evaluator (BRISQUE), Naturalness image quality evaluator (NIQE), and Perception-based image quality evaluator (PIQE) [57]. For all three methods, the lower the value, the better the quality [58]. These three models—BRISQUE, NIQE, and PIQE—represent major developments in the realm of image quality evaluation. They are predicated on the idea that certain statistical characteristics of high-quality images are much changed by distortions; these changes may be measured and applied to evaluate image quality in a way that fits human visual perception very nicely.

We evaluated the three metrics on random samples of our dataset and visualized that the score is better as the quality increases (as depicted in Figure 5). The higher the image quality, the lower the metrics score.

We calculated the BRISQUE, NIQE, and PIQE scores for a random image in the local dataset and its deformed variants with the default model using the MATLAB tool (version 24.1). For this, we created artificial noise and blurred images of the original picture (as shown in Figure 6). Then the corresponding scores are calculated. The scores obtained are mentioned in Table 4. All scores are lower for the original image; hence, the highest perceived quality comes from the original undistorted picture.

3.2. Splitting, Preprocessing, and Augmenting

The images are split into three parts: training, validation, and testing. Here, we utilized 80% of the data for training, 10% for testing, and 10% for validation. We separated the dataset at the patient level to keep subgroups separate and stop data from leaking. To make sure that no patient provided data to more than one partition, every image from the same patient was put together and assigned to just one subset (training, validation, or testing). Table 5 shows the training, validation, and testing data sizes. For binary classification, the mild, moderate, severe, and PDR classes are combined. Hence, the training count for DR class is 407, 36 for validation, and 44 for testing. The normal class has 237 for training, 45 for validation, and 37 for testing. The labels are named class 0 (DR) and class 1 (Normal) for binary classification. For the multi-class method, it is class 0 (Mild), class 1 (Moderate), class 3 (Normal), class 4 (PDR), and class 5 (Severe).

Images must go through a sequence of preprocessing actions to improve their quality and consistency before being included in ML or DL models. CLAHE is used to improve image quality, particularly in situations of low contrast or inadequate illumination [59]. This method guarantees seamless transitions between sections and improves contrast in localized areas of the image by preventing over-enhancing. To guarantee uniformity throughout the dataset, every image is lastly downsized to a consistent dimension. Regarding YOLO, normalization is easily managed during model training in part of its preprocessing flow. Automatically completing many preprocessing tasks, the YOLO model converts to RGB, scales pixel data to the range of zero to one, and uses predetermined mean and standard deviation values for normalizing [60]. These preprocessing techniques together equip the pictures for more accurate and efficient analysis by ML algorithms. Figure 7 shows the original and CLAHE preprocessed images from each five classes.

A complete data augmentation pipeline was utilized during training to improve the model’s generalization potential. RandAugment was utilized to randomize the composite transformation [61]. Photometric modifications were executed by altering hue (0.015), saturation (0.7), and brightness (0.4). Geometric alterations incorporated a 50% probability of horizontal flipping; however, vertical flipping was deactivated. We further implemented random translations of up to 10% and resized photos to 50% of their original dimensions. We employed Random Erasing with a probability of 0.4 to mimic occlusions and enhance the model’s robustness. All augmentations were only implemented during the training phase, whereas validation was conducted on non-augmented images to guarantee an impartial assessment of performance. Figure 8 shows some random augmented images during the training phase.

3.3. YOLO Models

The YOLO architecture consists of three main components: backbone, neck, and head, each of which experiences substantial alterations among its versions to improve performance [62]. Usually composed of a CNN pre-trained on big datasets, including ImageNet for classification tasks, the backbone oversees extracting features from incoming data. ResNet50, ResNet-101, and CSPDarkNet-53 are popular backbones in YOLO versions; each provides different depth and feature extraction capability to balance speed and accuracy. Figure 9 lists the fundamental YOLO architectural elements.

Joseph Redmon et al. released YOLOv1, the initial version of the YOLO object identification algorithm, in 2015 [16]. YOLO offers high detection accuracy with minimal background errors and better generalization for new domains, making it ideal for our applications such as image classification [62]. YOLO operates by partitioning an image into grid cells, with each cell predicting the item class. The design resizes the input image, employs 1 × 1 and 3 × 3 convolutions to diminish channels, and incorporates batch normalization and dropout to regularize the model and mitigate overfitting. It is a prevalent algorithm utilized across multiple domains, including healthcare, agriculture, security monitoring, and autonomous vehicles [63].

YOLOv8 improves the performance of its leaders, particularly YOLOv5, by offering outstanding speed and precision. Many training strategies and improvements, including advanced data augmentation techniques, an effective training workflow, and adaptability, help YOLOv8 to be strengthened [64]. The framework is designed to enable more computer vision tasks like classification of images and instance segmentation, therefore making it a flexible tool for several uses [65].

Figure 10 shows the YOLOv8 model’s architectural composition. Designed particularly for YOLOv8, CSPDarkNet is a novel backbone architecture [64,66]. Every convolution block features SiLU activation and batch normalization. A spatial pyramid pooling fast (SPPF) layer pools features into a fixed-size map, therefore accelerating computation. Though it offers various changes that increase its accuracy and efficiency for object detection, its foundation is the DarkNet architecture. The application of a cross-stage partial dense block (CSPDarkNet) marks one of its main developments. A CSP block, also known as a C2f block, separates the base layer’s feature map in half. One section is skipped, and the other passes through a thick block. The output of the C2f block is therefore formed by concatenating the two sections. Superior performance, anchor-free detection, a decoupled head, an attractive design, user-friendliness, and support for several tasks define YOLOv8 most importantly.

Designed for outstanding object identification, classification, and numerous visual activities, YOLO11 is a new model in the YOLO series [67]. It combines improvements in network architecture to enhance feature extraction and processing, therefore producing a lower number of parameters and higher accuracy. YOLO11 benefits from the strong Ultralytics ecosystem and is fit for use on several platforms [26]. Strong points of the model include increased accuracy, effective inference, task flexibility, scalability, and user friendliness. Still, it might need a lot of processing resources and have fewer third-party integrations than YOLOv8.

Unlike YOLOv8, YOLO11’s design uses C3K2 blocks to manage feature extraction at several backbone stages rather than the C2f block. By segmenting the feature map and using a sequence of smaller kernel convolutions, the C3K2 block maximizes information flow across the network. YOLO11 keeps the SPFF module, used to pool features from several areas of a picture at various rates. The C2PSA block (Cross Stage Partial with Spatial Attention) added in the neck is one of the major changes in YOLO11. By stressing spatial significance in the feature maps, this block introduces attention methods that enhance the focus of the model on significant areas inside an image, including smaller or partially obscured objects. Like YOLOv8, the architecture is similar; Figure 11 shows the new blocks of the YOLO11 model.

Each tailored for a balance of model size, inference speed, and accuracy detection, YOLO offers many classification models, including nano (n), small (s), medium (m), large (l), and extra-large (x). The choice among several YOLOv8 and YOLO11 varieties finally depends on the specific needs of an application, therefore balancing the relevance of speed against accuracy and evaluating the available technology. Table 6 offers, for every model based on the proposed research, architectural variation and parameters.

3.4. Model Parameters

The YOLO classification model was refined utilizing an extensive array of parameters and data augmentation techniques to enhance performance and generalization. The training was performed for 60 epochs using a batch size of 16 and an input picture resolution of 640 × 640 pixels. The models employed pretrained weights and a dropout rate of 0.2.

The selection of optimizers significantly influences the convergence rate and stability of DL models [68]. In this study, we utilized the automated configuration feature of the YOLO framework, which autonomously picks a suitable optimizer according to the model parameters and data requirements. This method guarantees a balanced compromise between convergence stability and performance, eliminating the need for manual adjustment of optimizer settings. Stochastic Gradient Descent (SGD), Adaptive Moment Estimation (Adam), Adam with Weight Decoupling (AdamW), Nesterov-accelerated Adaptive Moment Estimation (NAdam), Rectified Adam (RAdam), and Root Mean Square Propagation (RMSProp) are the optimizers in YOLO model. All models utilized the AdamW optimizer through automatic selection.

EarlyStopping was executed with a patience of 15 epochs to oversee the model for possible overfitting and terminating training if validation set performance did not improve within that interval. We also used a cosine learning rate scheduler to change the learning rate over epochs. It helps control the learning rate so that it converges better. All parameters for the training model are listed in Table 7.

YOLOv8 employs a cross-entropy (CE) loss function, while the YOLO11 model utilizes a focal-style CE. It evaluates the efficacy of a classification model that produces a probability output ranging from 0 to 1. Binary cross-entropy is employed for binary classification tasks, whilst categorical cross-entropy is utilized for multi-class classification tasks. The CE loss function is defined as

C E = - \sum_{n = 1}^{N} {(y}_{n} \cdot l o g (p_{n}))

(1)

where N is the number of classes, y is the true class label, and p_n is the true probability of the nth class.

3.5. Performance Metrics

This study assessed the efficacy of the DR classification model utilizing various standard metrics: accuracy, precision, recall, F1-score, and MCC. These metrics offer a thorough insight into the model’s capacity to accurately discern various phases of DR. These measurements are especially pertinent for DR classification, where the ramifications of misclassifying a severe disease as moderate (false negative) can be clinically substantial. Consequently, employing a varied array of evaluation measures guarantees a more dependable and clinically significant appraisal of model efficacy.

The confusion matrix and classification report are examined for each model. This study evaluates metrics such as accuracy, F1-score, recall, MCC, and precision. Each metric’s formulas are given below:

P r e c i s i o n = \frac{T r P o s}{T r P o s + F a P o s}

(2)

R e c a l l / S e n s i t i v i t y = \frac{T r P o s}{T r P o s + F a N e g}

(3)

A c c u r a c y = \frac{T r P o s + T r N e g}{T r P o s + T r N e g + F a P o s + F a N e g}

(4)

F 1 - s c o r e = \frac{2 * T r P o s}{(2 * T r P o s) + F a P o s + F a N e g}

(5)

M C C = \frac{(T r P o s * T r N e g) - (F a P o s * F a N e g)}{\sqrt{T r P o s + F a P o s)} (T r P o s + F a N e g) (T r N e g + F a P o s) (T r N e g + F a N e g)}

(6)

where TrPos, TrNeg, FaPos, and FaNeg denote true positives, true negatives, false positives, and false negatives, respectively.

Accuracy is the comprehensive precision of the model in classifying DR. Precision denotes the ratio of accurately predicted instances within each category, illustrating, for example, the number of images classified as severe or PDR that were genuinely true cases. Recall (sensitivity) assesses the model’s proficiency in accurately identifying true instances within each class, which is especially vital for recognizing moderate to advanced stages of DR. The F1-score, being the harmonic means of precision and recall, offers a balanced performance metric, particularly useful in cases with unequal class distributions. The MCC provides a thorough assessment by condensing the confusion matrix into a single statistic, even in the presence of class imbalance. A high MCC value across all five DR classes underscores the model’s dependability and consistency, affirming its appropriateness for clinical decision-making in the early diagnosis, categorization, and monitoring of DR.

The area under the receiver operating characteristic curve (AUC-ROC) is an extensively employed metric for analyzing the performance of classification techniques, especially in unbalanced medical datasets where accuracy may be deceptive. An AUC of 1.0 signifies flawless classification, whereas a score of 0.5 denotes performance akin to random guessing. AUC provides an extensive assessment of model performance that transcends mere overall accuracy, particularly in healthcare contexts where false negatives might result in severe consequences [69].

3.6. Model Explainability and Demonstration

Eigen-CAM emphasizes areas in the image that are most pertinent to a certain class prediction [70]. We modified the Eigen-CAM approach to function with the YOLO architecture, which initially produces model outputs as tuples. To guarantee compatibility with the Eigen-CAM framework, we developed a streamlined wrapper over the model to extract and return just the classification logits. This enabled the explainability approach to produce heatmaps on input images, emphasizing the areas that significantly impacted the model’s predictions.

With Gradio, we created a web-based interface to enable real-time interaction with the model [56]. Gradio is an open-source Python application that implements ML/DL models through a user interface. The user-friendly display makes the model easy to read and assess for both developers and non-developers. Web-based interfaces may be created for users to enter the image. Gradio user interface simplifies DL model construction, allowing for performance demonstration, feedback gathering, testing, and sharing via given links. This tool primarily classifies photos, removing the need for sophisticated frontend programming. Gradio’s user-friendliness makes it ideal for prototyping and interpreting ML and DL tasks [71,72].

4. Results

All experiments were performed on a workstation with an NVIDIA GeForce RTX 4090 with 24 GB of VRAM. The software environment utilized Windows 11, equipped with 128 GB of RAM, Python 3.10, and Ultraytics 8.3.146.

4.1. Binary Classification

Table 8 describes the EarlyStopping epoch, the final learning rate, training time, and inference latency for the binary classification. Inference latency is a vital statistic in AI and ML, especially when implementing models in practical applications. It denotes the latency between the presentation of an image input to a trained model and the subsequent generation of a prediction or output by the model. It quantifies the speed at which a model can analyze fresh data and deliver a conclusion. Reducing inference latency is typically essential for applications that demand prompt replies, thus affecting the usability and efficacy of AI systems.

Table 9 shows the training and test results for each model. The YOLO 11l model had a test accuracy of 97.02%, which means it performed well in sorting fundus images into the right classes. The precision is 97.02%, which means that the model makes very few false-positive predictions. A recall rate of 97.01% shows that the test is very good at finding genuine positives, which is very important for making sure that doctors do not miss any diagnoses. The F1-score is 97.02%, which shows that there was a good balance between accuracy and recall. The MCC is 0.9404, which shows that the model works well even when there is an imbalance in the classes.

The YOLO 11l model’s training and validation loss plot is depicted in Figure 12. The training loss demonstrated a consistent decline. The validation accuracy showed obvious proof of consistent learning. The curves demonstrate smooth and consistent convergence with a small, stable gap between the training and validation measures. This shows that the model can generalize well. The final reported metrics are obtained using optimal model weights, obtained by an EarlyStopping technique.

The confusion matrix for the best model is shown in Figure 13. The quadratic weighted kappa (QWK) score is 0.95. This shows that the model is very close to the expert labels. The model made very few under-grades and over-grades (3% and 4%). This means that the model is clinically preferable. The AUC score of each class is represented in Figure 14.

4.2. Multi-Class Classification

Table 10 describes the EarlyStopping epoch, the final learning rate, training time, and inference latency for the five-class classification.

Table 11 shows the training and test results for each model that can classify five classes. The YOLO 8x model is 80.12% accurate, indicating high overall correctness in classifying fundus images into appropriate DR stages. The other metrics indicate a precision score of 85.14%, a recall rate of 80.12%, an F1-score of 81.43%, and an MCC score of 0.7288.

The YOLO 8x model’s training and validation loss plot is depicted in Figure 15. The validation accuracy showed clear evidence of steady learning. The curves converge with a modest, persistent gap between training and validation measurements. It indicates the model generalizes nicely.

The confusion matrix for the best model is shown in Figure 16. The model attained a QWK of 0.707. The patterns of over- and under-grading were assessed for clinical significance and found that most mistakes occurred between nearby classes (e.g., severe vs. moderate DR), which present a reduced clinical risk. There were not many clinically important mistakes, such as giving severe patients a lower grade than they deserved. This shows that sight-threatening illness was not missed to the point where patient safety would be at risk. The AUC score of each class is represented in Figure 17.

4.3. Comparison with Other Pretrained Models

Efficient image classification tasks are the primary objective of YOLO models (v8 and v11). The model’s main goal for classification is to give the whole input image one label instead of finding and categorizing each object separately. YOLO architecture is built for fast processing that is close to real time, which makes them quicker than many classic significant classification models.

In addition to the YOLO models, the study also employed different pretrained models with CLAHE and 640 × 640 image size. ConvNeXt (small, base, and tiny), EfficientNet (V2S, and B0), and Vision Transformer (ViT) models were analyzed. Out of all pretrained models, ConvNeXtBase achieved 62% accuracy for binary classification and EfficientNetV2S obtained 69% accuracy for 5-class classification. These results (in Table 12) show less performance compared to the YOLO models.

5. Discussion

In this proposed study, a YOLO-based DL-based method is employed for classifying DR severity utilizing a unique, private collection of high-resolution retinal fundus images. In contrast to established datasets like EyePACS or Messidor, the dataset utilized in this proposed study is freshly curated and has not been previously examined. Although independent grader data is not preserved, this organized adjudication procedure guarantees dependable and clinically acceptable labeling for model development. This is, to our knowledge, the first study to implement the YOLO architecture for binary and five-class classification of DR on this dataset.

The validation loss oscillation gives us valuable information about how the training works. In particular, the stability of the validation accuracy curve shows that the process of learning features was strong and reliable. These results together show that the final model weights are a very generic and useful way to solve the classification issue.

We performed a comparative assessment of YOLO model variations using several metrics, including accuracy, precision, recall, F1-score, and MCC, as shown in Table 13. The Friedman test revealed a statistically significant difference in performance across the models. The mean ranking across various metrics depicted the YOLO11l as the preeminent model for binary classification with an average rank of 2.33, and YOLO8x model for 5-class classification with an average rank of 1.

The suggested model, YOLO 11l, attained a macro-average AUC score of 0.98 in binary classification, as seen in Figure 18a, differentiating between normal and DR instances. In the five-class classification challenge, encompassing Normal, Mild, Moderate, Severe, and PDR, the YOLO 8x model attained a macro-average AUC of 0.88, as seen in Figure 18b. The results indicate that the models can extract significant characteristics from retinal pictures that generalize across both binary and multi-class DR grading. Elevated AUC values indicate robust discriminative efficacy across various classification thresholds, rendering them especially advantageous in clinical contexts where sensitivity and specificity are paramount.

YOLO’s real-time processing capabilities facilitate the swift categorization of retinal fundus images, which is crucial for screening extensive patient populations or incorporating into clinical processes that need prompt decision-making. The model proficiently detects significant retinal anomalies, resulting in elevated AUC ratings and dependable severity classification in both binary and multi-class contexts. The advanced YOLO (versions 8 and 11) demonstrated 79% accuracy with the ImageNet dataset in the extra-large (x) model, in comparison to other models (n, s, m, and l) [26,66]. The proposed study attained an accuracy of 97.02% for binary classification with YOLO 11l, and 80.12% accuracy for a 5-class classification task with the YOLO 8x model.

Most studies based on YOLO architecture utilized the model for segmenting or localization. Table 14 shows comparison between YOLO models based on DR classification task.

Heatmaps over the input images are produced using the Eigen-CAM explainability approach, which highlights the regions that have the most influence on the model’s classification results. After preprocessing the input images, activation maps were taken out of the model’s last convolutional block. The resultant heatmaps were superimposed on the original pictures to illustrate the most prominent spots for the DR stages (as seen in Figure 19).

Alongside internal validation, we implemented the YOLO11s model with Gradio to provide an interactive web-based demonstration. The interface enables users to submit photos and observe the anticipated classifications. We utilized random pictures from the EyePACS dataset [73] to evaluate the model (Figure 20). The effective incorporation of Gradio into our pipeline establishes a basis for future application development, wherein we want to transform this prototype into a comprehensive, standalone application with augmented functionality, user administration, and interaction with clinical databases.

5.1. Ablation Study

To assess the efficacy of the preprocessing and input configuration employed in our final model, we performed an ablation study by eliminating CLAHE and decreasing the input resolution. First, we preserved the 640 × 640 resolution but omitted CLAHE, and secondly, we further diminished the input resolution to 240 × 240. The binary classification results show that removing CLAHE led to a drop in performance from 97.02% to 93.98%, demonstrating its contribution to enhancing local contrast and model discriminability. Additionally, limiting the resolution to 240 considerably dropped performance to 90.36%, perhaps owing to the loss of tiny information required for categorization. These results validate the selection of CLAHE and high-resolution inputs in our final design. Similarly, in the 5-class classification, removing CLAHE led to a drop in performance from 80.12% to 77.11%, and limiting the resolution to 240 dropped to 76.51%. Table 15 displays the findings of the analysis of the best approach for the ablation study.

5.2. Limitation and Future Work

The suggested study employed a private dataset, thereby constraining generalizability outside the specified data area. Gradio has been evaluated using a public dataset for demonstration purposes; however, it has not been verified in practical applications. Although image-based categorization demonstrates robust outcomes, the incorporation of clinical data such as patient age, HbA1c levels may enhance model precision.

Expanding the dataset to incorporate multi-center data will be the main goal of future work to facilitate wider validation and lessen any bias. Additionally, we intend to expand the Gradio-based demonstration to a complete application with more features and an intuitive user interface that can be used in a wider setting.

6. Conclusions

Several YOLO v8 and 11 model versions were created and assessed in this study to diagnose and grade DR. The YOLO11l and YOLO8x models exhibited enhanced accuracy (97.02% accuracy for binary, and 80.12% accuracy for 5-class) and resilience across critical measures in a thorough performance evaluation on a private dataset, corroborated by statistical validation via the Friedman test. The capacity of these models to accurately detect and categorize DR stages underscores their potential to assist in clinical diagnosis. Future endeavors will concentrate on enhancing dataset variety and refining deployment for real-time clinical applications.

Author Contributions

Conceptualization, A.M.M. and K.A.S.; Data curation, S.R.; Formal analysis, S.S.; Funding acquisition, A.M.M.; Investigation, S.R. and S.S.; Methodology, A.M.M., K.A.S. and S.S.; Project administration, A.M.M.; Resources, A.M.M.; Software, S.S.; Supervision, A.M.M.; Validation, A.M.M., S.R. and S.S.; Writing—original draft, A.M.M. and S.S.; Writing—review and editing, A.M.M., K.A.S., S.R. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Kuwait University Research Grant No. EO04/18.

Institutional Review Board Statement

The study was approved by the Office of the Vice Dean for Academic Affairs, Research and Graduate Studies, College of Engineering and Petroleum, Kuwait University (reference number: 24/2/757, 15 October 2023).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data required to reproduce the above findings cannot be shared for ethical reasons.

Acknowledgments

The authors thank Kuwait University for their continuous support in completing this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Boyd, K. Diabetic Retinopathy: Causes, Symptoms, Treatment. American Academy of Ophthalmology. Available online: https://www.aao.org/eye-health/diseases/what-is-diabetic-retinopathy (accessed on 16 June 2025).
Mutawa, A.M.; Al-Sabti, K.; Raizada, S.; Sruthi, S. A Deep Learning Model for Detecting Diabetic Retinopathy Stages with Discrete Wavelet Transform. Appl. Sci. 2024, 14, 4428. [Google Scholar] [CrossRef]
Macsik, P.; Pavlovicova, J.; Kajan, S.; Goga, J.; Kurilova, V. Image preprocessing-based ensemble deep learning classification of diabetic retinopathy. IET Image Process. 2024, 18, 807–828. [Google Scholar] [CrossRef]
Zaylaa, A.J.; Kourtian, S. From Pixels to Diagnosis: Early Detection of Diabetic Retinopathy Using Optical Images and Deep Neural Networks. Appl. Sci. 2025, 15, 2684. (In English) [Google Scholar] [CrossRef]
Renu, D.S.; Saji, K.S. Hybrid deep learning framework for diabetic retinopathy classification with optimized attention AlexNet. Comput. Biol. Med. 2025, 190, 110054. [Google Scholar] [CrossRef]
Moannaei, M.; Jadidian, F.; Doustmohammadi, T.; Kiapasha, A.M.; Bayani, R.; Rahmani, M.; Jahanbazy, M.R.; Sohrabivafa, F.; Anar, M.A.; Magsudy, A.; et al. Performance and limitation of machine learning algorithms for diabetic retinopathy screening and its application in health management: A meta-analysis. Biomed. Eng. Online 2025, 24, 34. [Google Scholar] [CrossRef]
Alsohemi, R.; Dardouri, S. Fundus Image-Based Eye Disease Detection Using EfficientNetB3 Architecture. J. Imaging 2025, 11, 279. [Google Scholar] [CrossRef]
Yu, T.; Shao, A.; Wu, H.; Su, Z.; Shen, W.; Zhou, J.; Lin, X.; Shi, D.; Grzybowski, A.; Wu, J.; et al. A Systematic Review of Advances in AI-Assisted Analysis of Fundus Fluorescein Angiography (FFA) Images: From Detection to Report Generation. Ophthalmol. Ther. 2025, 14, 599–619. [Google Scholar] [CrossRef] [PubMed]
Seo, H.; Park, S.-J.; Song, M. Diabetic Retinopathy (DR): Mechanisms, Current Therapies, and Emerging Strategies. Cells 2025, 14, 376. (In English) [Google Scholar] [CrossRef] [PubMed]
Manohar, R.; Aarthi, M.S.; Ancy Jenifer, J. Leveraging Deep Learning For Early Stage Diabetic Retinopathy Detection: A Novel CNN and Transfer Learning Comparison. In Proceedings of the 2024 2nd International Conference on Artificial Intelligence and Machine Learning Applications Theme: Healthcare and Internet of Things (AIMLA), Namakkal, India, 15–16 March 2024; pp. 1–6. [Google Scholar] [CrossRef]
Asif, S. DEO-Fusion: Differential evolution optimization for fusion of CNN models in eye disease detection. Biomed. Signal Process. Control. 2025, 107, 107853. [Google Scholar] [CrossRef]
Zubair, M.; Umair, M.; Naqvi, R.A.; Hussain, D.; Owais, M.; Werghi, N. A comprehensive computer-aided system for an early-stage diagnosis and classification of diabetic macular edema. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 101719. [Google Scholar] [CrossRef]
Hussain, M. YOLOv1 to v8: Unveiling Each Variant–A Comprehensive Review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Mao, M.; Hong, M. YOLO Object Detection for Real-Time Fabric Defect Inspection in the Textile Industry: A Review of YOLOv1 to YOLOv11. Sensors 2025, 25, 2270. (In English) [Google Scholar] [CrossRef]
Du, J. Understanding of Object Detection Based on CNN Family and YOLO. J. Phys. Conf. Ser. 2018, 1004, 012029. (In English) [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; NanoCode; TaoXie; Kwon, Y.; Michael, K.; Liu, C.; Fang, J.; et al. ultralytics/yolov5: v6.0—YOLOv5n ‘Nano’ Models, Roboflow Integration, TensorFlow Export, OpenCV DNN Support; Zenodo: Geneve, Switzerland, 2021. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A full-scale reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Wang, Y.; Rong, Q.; Hu, C. Ripe tomato detection algorithm based on improved YOLOv9. Plants 2024, 13, 3253. [Google Scholar] [CrossRef] [PubMed]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLO11. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 30 September 2024).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Zhang, Z.; Zhao, H.; Dong, L.; Luo, L.; Wang, H. A Study on the Interpretability of Diabetic Retinopathy Diagnostic Models. Bioengineering 2025, 12, 1231. [Google Scholar] [CrossRef]
Şevik, U.; Mutlu, O. Automated Multi-Class Classification of Retinal Pathologies: A Deep Learning Approach to Unified Ophthalmic Screening. Diagnostics 2025, 15, 2745. [Google Scholar] [CrossRef]
Singh, A.; Jain, S.; Arora, V. A Multi-Model Image Enhancement and Tailored U-Net Architecture for Robust Diabetic Retinopathy Grading. Diagnostics 2025, 15, 2355. [Google Scholar] [CrossRef]
Youldash, M.; Rahman, A.; Alsayed, M.; Sebiany, A.; Alzayat, J.; Aljishi, N.; Alshammari, G.; Alqahtani, M. Early Detection and Classification of Diabetic Retinopathy: A Deep Learning Approach. AI 2024, 5, 2586–2617. [Google Scholar] [CrossRef]
Wei, X.; Liu, Y.; Zhang, F.; Geng, L.; Shan, C.; Cao, X.; Xiao, Z. MSTNet: Multi-scale spatial-aware transformer with multi-instance learning for diabetic retinopathy classification. Med. Image Anal. 2025, 102, 103511. [Google Scholar] [CrossRef] [PubMed]
Hidri, M.S.; Hidri, A.; Alsaif, S.A.; Alahmari, M.; AlShehri, E. Optimal Convolutional Networks for Staging and Detecting of Diabetic Retinopathy. Information 2025, 16, 221. (In English) [Google Scholar] [CrossRef]
Sharma, N.; Lalwani, P. A multi model deep net with an explainable AI based framework for diabetic retinopathy segmentation and classification. Sci. Rep. 2025, 15, 8777. [Google Scholar] [CrossRef]
Herrero-Tudela, M.; Romero-Oraá, R.; Hornero, R.; Tobal, G.C.G.; López, M.I.; García, M. An explainable deep-learning model reveals clinical clues in diabetic retinopathy through SHAP. Biomed. Signal Process. Control. 2025, 102, 107328. [Google Scholar] [CrossRef]
Posham, U.; Bhattacharya, S. Diabetic Retinopathy Detection Using Deep Learning Framework and Explainable Artificial Intelligence Technique. In Proceedings of the 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 18–19 January 2024; pp. 411–415. [Google Scholar] [CrossRef]
Ohri, K.; Kumar, M. Supervised fine-tuned approach for automated detection of diabetic retinopathy. Multimed. Tools Appl. 2024, 83, 14259–14280. [Google Scholar] [CrossRef]
Naz, H.; Nijhawan, R.; Ahuja, N.J.; Saba, T.; Alamri, F.S.; Rehman, A. Micro-segmentation of retinal image lesions in diabetic retinopathy using energy-based fuzzy C-Means clustering (EFM-FCM). Microsc. Res. Tech. 2024, 87, 78–94. [Google Scholar] [CrossRef]
Navaneethan, R.; Devarajan, H. Enhancing diabetic retinopathy detection through preprocessing and feature extraction with MGA-CSG algorithm. Expert Syst. Appl. 2024, 249, 123418. [Google Scholar] [CrossRef]
Manoj S, H.; Bosale, A.A. Detection and Classification of Diabetic Retinopathy using Deep Learning Algorithms for Segmentation to Facilitate Referral Recommendation for Test and Treatment Prediction. arXiv 2024, arXiv:2401.02759. [Google Scholar] [CrossRef]
Bhimavarapu, U. Diagnosis and multiclass classification of diabetic retinopathy using enhanced multi thresholding optimization algorithms and improved Naive Bayes classifier. Multimed. Tools Appl. 2024, 83, 81325–81359. [Google Scholar] [CrossRef]
Nahiduzzaman, M.; Islam, M.R.; Goni, M.O.F.; Anower, M.S.; Ahsan, M.; Haider, J.; Kowalski, M. Diabetic Retinopathy Identification Using Parallel Convolutional Neural Network Based Feature Extractor and ELM Classifier. Expert Syst. Appl. 2023, 217, 119557. [Google Scholar] [CrossRef]
Li, Z.; Han, Y.; Yang, X. Multi-Fundus Diseases Classification Using Retinal Optical Coherence Tomography Images with Swin Transformer V2. J. Imaging 2023, 9, 203. [Google Scholar] [CrossRef]
Acosta-Jiménez, S.; Maeda-Gutiérrez, V.; Galván-Tejada, C.E.; Mendoza-Mendoza, M.M.; Reveles-Gómez, L.C.; Celaya-Padilla, J.M.; Galván-Tejada, J.I.; García-Domínguez, A. Assessing ResNeXt and RegNet Models for Diabetic Retinopathy Classification: A Comprehensive Comparative Study. Diagnostics 2025, 15, 1966. [Google Scholar] [CrossRef]
Touati, M.; Touati, R.; Nana, L.; Benzarti, F.; Ben Yahia, S. DRCCT: Enhancing Diabetic Retinopathy Classification with a Compact Convolutional Transformer. Big Data Cogn. Comput. 2025, 9, 9. [Google Scholar] [CrossRef]
Ema, R.R.; Shill, P.C. Multi-model approach for precise lesion localization and severity grading for diabetic retinopathy and age-related macular degeneration. Front. Comput. Sci. 2025, 7, 1497929. (In English) [Google Scholar] [CrossRef]
Geetha, D.A.; Lakshmi, T.H.; Sagar, K.V.; Chaitanya, M.; Kantamaneni, S.; Battula, V.V.R.; Borra, S.P.R.; Meena, P.; Gupta, P.; Agarwal, D.S.; et al. Detection and Classification of Diabetic Retinopathy Using YOLO-V8 Deep Learning Methodology. J. Theor. Appl. Inf. Technol. 2024, 102, 7580–7588. [Google Scholar]
Rizzieri, N.; Dall’asta, L.; Ozoliņš, M. Diabetic Retinopathy Features Segmentation without Coding Experience with Computer Vision Models YOLOv8 and YOLOv9. Vision 2024, 8, 48. (In English) [Google Scholar] [CrossRef] [PubMed]
Zhang, B.; Li, J.; Bai, Y.; Jiang, Q.; Yan, B.; Wang, Z. An Improved Microaneurysm Detection Model Based on SwinIR and YOLOv8. Bioengineering 2023, 10, 1405. (In English) [Google Scholar] [CrossRef] [PubMed]
Sait, A.R.W. A Lightweight Diabetic Retinopathy Detection Model Using a Deep-Learning Technique. Diagnostics 2023, 13, 3120. [Google Scholar] [CrossRef]
L., R.; Padyana, A. Detection of Diabetic Retinopathy in Retinal Fundus Image Using YOLO-RF Model. In Proceedings of the 2021 Sixth International Conference on Image Information Processing (ICIIP), Shimla, India, 26–28 November 2021; Volume 6, pp. 105–109. [Google Scholar] [CrossRef]
Santos, C.; Aguiar, M.; Welfer, D.; Belloni, B. A New Approach for Detecting Fundus Lesions Using Image Processing and Deep Neural Network Architecture Based on YOLO Model. Sensors 2022, 22, 6441. [Google Scholar] [CrossRef] [PubMed]
Mahapadi, A.A.; Shirsath, V.; Pundge, A. Real-Time Diabetic Retinopathy Detection Using YOLO-v10 with Nature-Inspired Optimization. Biomed. Mater. Devices 2025, 3, 1–23. [Google Scholar] [CrossRef]
Liao, Y.; Li, L.; Xiao, H.; Xu, F.; Shan, B.; Yin, H. YOLO-MECD: Citrus Detection Algorithm Based on YOLOv11. Agronomy 2025, 15, 687. [Google Scholar] [CrossRef]
Dihin, R.A.; AlShemmary, E.N.; Al-Jawher, W.A.M. Wavelet-Attention Swin for Automatic Diabetic Retinopathy Classification. Baghdad Sci. J. 2024, 21, 2741–2756. [Google Scholar] [CrossRef]
Abid, A.; Abdalla, A.; Abid, A.; Khan, D.; Alfozan, A.; Zou, J. Gradio: Hassle-free sharing and testing of ML models in the wild. arXiv 2019, arXiv:1906.02569. [Google Scholar] [CrossRef]
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-Reference Image Quality Assessment in the Spatial Domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef]
Ž, T.; Pintarić, N.; Saulig, N. Blind Image Quality Assessment Score for Humanities Online Digital Repositories. In Proceedings of the 2024 9th International Conference on Smart and Sustainable Technologies (SpliTech), Split, Croatia, 25–28 June 2024; pp. 1–6. [Google Scholar] [CrossRef]
Setiawan, A.W.; Mengko, T.R.; Santoso, O.S.; Suksmono, A.B. Color retinal image enhancement using CLAHE. In Proceedings of the International Conference on ICT for Smart Society, Yogyakarta, Indonesia, 13–14 June 2013; pp. 1–3. [Google Scholar] [CrossRef]
Ultralytics. Data Preprocessing Techniques for Annotated Computer Vision Data. Available online: https://docs.ultralytics.com/guides/preprocessing_annotated_data/ (accessed on 26 April 2025).
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. RandAugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); Seattle, WA, USA, 13–19 June 2020, pp. 702–703.
Ali, M.L.; Zhang, Z. The YOLO Framework: A Comprehensive Review of Evolution, Applications, and Benchmarks in Object Detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Hussain, M. YOLOv5, YOLOv8 and YOLOv10: The go-to detectors for real-time vision. arXiv 2024, arXiv:2407.02988. [Google Scholar]
Liu, Y.; Liu, Y.; Guo, X.; Ling, X.; Geng, Q. Metal surface defect detection using SLF-YOLO enhanced YOLOv8 model. Sci. Rep. 2025, 15, 11105. [Google Scholar] [CrossRef] [PubMed]
Glenn, J.; Ayush, C.; Jing, Q. Ultralytics YOLOv8. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 26 September 2024).
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Sun, R.-Y. Optimization for Deep Learning: An Overview. J. Oper. Res. Soc. China 2020, 8, 249–294. [Google Scholar] [CrossRef]
Fan, J.; Upadhye, S.; Worster, A. Understanding receiver operating characteristic (ROC) curves. Can. J. Emerg. Med. 2006, 8, 19–20. [Google Scholar] [CrossRef]
Muhammad, M.B.; Yeasin, M. Eigen-CAM: Visual Explanations for Deep Convolutional Neural Networks. SN Comput. Sci. 2021, 2, 47. [Google Scholar] [CrossRef]
Nandhini, E.; Vadivu, G. Convolutional Neural Network-Based Multi-Fruit Classification and Quality Grading with a Gradio Interface. In Proceedings of the 2024 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India, 12–13 December 2024; pp. 1–7. [Google Scholar] [CrossRef]
Selvaraj, J.; Sadaf, K.; Aslam, S.M.; Umapathy, S. Multiclassification of Colorectal Polyps from Colonoscopy Images Using AI for Early Diagnosis. Diagnostics 2025, 15, 1285. (In English) [Google Scholar] [CrossRef]
Cuadros, J.; Bresnick, G. EyePACS: An adaptable telemedicine system for diabetic retinopathy screening. J. Diabetes Sci. Technol. 2009, 3, 509–516. (In English) [Google Scholar] [CrossRef] [PubMed]

Figure 1. The history of the YOLO model: how it changed over time.

Figure 2. The framework for the proposed study of DR detection and grading.

Figure 3. The images from the Kuwait-based dataset. (a) Normal retina, (b) Mild NpDR, (c) Moderate NpDR, (d) Severe NpDR, and (e) PDR.

Figure 4. RGB channels of the fundus image (a random image from Mild NpDR).

Figure 5. The BRISQUE, NIQE, and PIQE of an image in the dataset.

Figure 6. The noise and blurred image of the original random image from the dataset.

Figure 7. The original and CLAHE preprocessed images from five classes.

Figure 8. Sample images during the training phase with augmentation.

Figure 9. Basic YOLO architecture.

Figure 10. (a) Architecture of the YOLOv8 model, (b) each block in the architecture.

Figure 11. (a) YOLO11 architectural additions: C3K2 and C2PSA Blocks. The backbone, neck, and head are like YOLOv8. The C2f block is replaced by the C3K2 block, and C2PSA is an addition in the neck of the model, (b) each block in the architecture.

Figure 12. The training and validation loss-accuracy plot of the YOLO 11l model for binary classification.

Figure 13. The confusion matrix of the YOLO 11l model for binary classification.

Figure 14. The individual class AUC score of the YOLO 11l model for binary classification.

Figure 15. The training and validation loss-accuracy plot of the YOLO 8x model for five-class classification.

Figure 16. The confusion matrix of the YOLO 8x model for five-class classification.

Figure 17. The individual class AUC score of the YOLO 8x model for five-class classification.

Figure 18. The AUC-ROC curve (a) YOLO 11l model for binary classification, and (b) YOLO 8x model for 5-class classification.

Figure 19. Visualization of the model classification with Eigen-CAM.

Figure 20. Flow diagram of Gradio interface. The model is tested with EyePACS dataset.

Table 1. Key Features of each YOLO version.

YOLO Version	Year	Key Architectural Innovation	Core Methodological Improvement	Reference
YOLOv1	2015	Darknet based custom CNN	First end-to-end detection, greatly improving speed, and single staged real time prediction.	[16]
YOLOv2	2016	Darknet-19	Introduced anchor boxes and dimension clusters for better localization; used batch normalization for stability.	[17]
YOLOv3	2018	Darknet-53	Introduced multi-scale detection via a Feature Pyramid Network (FPN) mechanism for improved small object detection.	[18]
YOLOv4	2020	CSPDarknet53	Incorporated numerous ‘Bag of Freebies’ (e.g., CutMix, Mosaic) for significant performance gains.	[19]
YOLOv5	2020	PyTorch 1.8.0 Implementation, Focus Layer	Introduced Focus Layer (later replaced by Conv) and major pipeline optimizations; emphasized efficiency and accessibility.	[20]
YOLOv6	2022	EfficientRep Backbone	Added bidirectional concatenation module to improve localization signals and auxiliary regression branch during training.	[21]
YOLOv7	2022	E-ELAN Block, Trainable Bag-of-Freebies	Introduced extended efficient layer aggregation networks (E-ELAN) for better parameter usage and re-parameterization techniques.	[22]
YOLOv8	2023	C2f Block (C3-like), Decoupled Head	Used a simpler C2f module and replaced the coupling of classification and detection with a decoupled head for improved convergence.	[23]
YOLOv9	2024	Generalized Efficient Layer Aggregation Network (GELAN)	Introduced to prevent information loss in deep networks, enhance data utilization and boosting performance.	[24]
YOLOv10	2024	Enhanced version of CSPNet	Including large-kernel convolutions and partial self-attention modules to improve performance.	[25]
YOLO 11	2024	Enhanced version of CSPNet	Expanded capabilities across multiple computer vision tasks (object detection, instance segmentation, classification).	[26]
YOLO 12	2025	Residual Efficient Layer Aggregation Networks	Attention-centric architecture	[27]

Table 2. The background studies based on DR.

Reference, Year	Dataset	Dataset Size	Methodology	Findings	Limitations
[4], 2025	APTOS with 5 classes	3662 images	Employed different pretrained models	ResNet-50 and GoogleNet models outperform other pretrained models with 93.56% accuracy.	The model must be externally validated to be generalized.
[5], 2025	APTOS and EyePACS with 5 classes	APTOS-3662 EyePACS-35,126	Attention block with AlexNet model	The model reached 99.51% accuracy for APTOS data and 99.43% for EyePACS data.	The model must be externally validated to be generalized.
[32], 2025	APTOS, Messidor-5 classes, RFMiD2020, IDRiD-binary	APTOS-3662, RFMiD2020-1900 Messidor-1200 IDRiD-516	A transformer network is utilized for the classification model.	The APTOS and RFMiD2020 datasets performed better accuracy with 97% each.	Different evaluation metrics are used for different datasets.
[34], 2025	DiaRetDB1-binary, APTOS and EyePACS-5 classes.	DiaRetDB1-1008 APTOS-7170 EyePACS-21,600	Different feature extraction and optimization algorithms are employed with the U-Net classification model.	The model achieved 99% on all three datasets.	Despite the model’s lightweight architecture, real-time applications may still encounter difficulties due to the complexity of fundus images.
[35], 2025	APTOS-2019, EyePACS, DDR, IDRiD, and SUSTech-SYSU	APTOS-3662 EyePACS-35,126 SUSTech-SYSU-1219 DDR-12,552 IDRiD-516	Employed different pretrained models and SHAP explainability for the model.	Achieved an accuracy of 89% when tested on the SUSTech-SYSU dataset.	Despite the model’s lightweight architecture, real-time applications may still encounter difficulties due to the complexity of fundus images.
[46], 2025	KLC and Shiromon1 datasets with binary classification	KLC-8000 Shiromon1-10,000	Uses YOLO model for localization, CNN, RF, and SVM for classification	With CNN-RF, the model shows 98.81% accuracy for KLC data and 92.11% with the Shiromon1 dataset.	The necessity for additional inquiry into the interpretability and explainability of models in medical applications is overlooked.
[47], 2024	Not specified 5 classes	-	The YOLOv8 model is compared with CNN, SVM, VGG16, and ResNet50.	VGG16 shows 63.47% accuracy compared to other models.	The study needs to evaluate more performance metrics, external validation, and the model’s interpretability.
[55], 2024	APTOS with 5 classes	3662 images	Wavelet with Swin Transformer model	The accuracy of classification was enhanced.	The study employed just a single image set for evaluating the model.
[50], 2023	APTOS and EyePACS with 5 classes	APTOS-5590 EyePACS-35,100	Employed YOLOv7 for feature extraction and MobileNetV3 for classification.	The model achieved 98% for APTOS and 98.4% for the EyePACS dataset.	Despite the model’s lightweight architecture, real-time applications may still encounter difficulties due to the complexity of fundus images.
[51], 2021	EyePACS and IDRiD dataset with 5 classes	EyePACS-35,100 IDRiD-516	The YOLO-RF model is compared to YOLO, RF, SVM, and Decision Tree.	The model achieved 99.3% accuracy compared to other models.	Only accuracy, precision, and recall are provided. The model must be externally validated to be generalized.

Table 3. Information about the local dataset size.

Stages	Normal	Mild	Moderate	Severe	PDR
Before cleaning	426	81	113	84	392
After cleaning	319	69	92	69	257
For binary classification	319	487

Table 4. The scores of the BRISQUE, NIQE, and PIQE metrics based on a random image.

Metrics	Original	Noisy	Blurry
BRISQUE	30.487	48.618	50.072
NIQE	3.709	12.438	5.593
PIQE	8.099	74.114	91.703

Table 5. The training, validation, and test split for the proposed study (a split is done at the patient level to stop data leaking).

Classes	Training	Validation	Testing
Normal	279	18	22
Mild	54	10	5
Moderate	63	13	16
Severe	48	10	11
PDR	195	34	28

Table 6. The details of each YOLO model employed in the proposed study with layers, parameters, and architectural differences.

Model	Layers	Params (M)	FLOPs (B)	Architectural Difference
YOLO11n	86	1.53	3.3	Employed C3k2 block in backbone and added C2PSA to improve spatial attention
YOLO11s	86	5.46	12.1
YOLO11m	106	10.36	39.6
YOLO11l	176	12.84	49.8
YOLO11x	176	28.36	111.0
YOLOv8n	56	1.44	3.4	Dynamic Kernel Attention, Path Aggregation Network
YOLOv8s	56	5.08	12.6
YOLOv8m	80	15.77	41.9
YOLOv8l	104	36.20	99.1
YOLOv8x	104	56.14	154.3

Table 7. The parameters employed in the study for all the YOLO models.

Parameter	Value
Image size	640 × 640 × 3
Optimizer	AdamW (auto select according to the dataset)
Learning rate	0.01
Batch size	16
Cosine learning rate scheduler	True
Epochs	100
Patience	15

Table 8. The execution time for all the YOLO models in the binary classification module.

Models	EarlyStopping	Final Learning Rate	Training Time (hours)	Inference Latency (ms)
YOLO 8n	55	0.0016	0.033	1.1
YOLO 8s	35	0.0016	0.026	2.0
YOLO 8m	39	0.0016	0.026	3.8
YOLO 8l	33	0.0016	0.021	4.7
YOLO 8x	43	0.0016	0.042	6.6
YOLO 11n	33	0.0016	0.028	8.7
YOLO 11s	33	0.0016	0.030	13.2
YOLO 11m	25	0.0016	0.024	7.1
YOLO 11l	29	0.0016	0.026	11.6
YOLO 11x	34	0.0016	0.036	16.8

Table 9. The performance metrics of all the YOLO models in the binary classification module.

Models	Validation Accuracy	Testing Accuracy	Precision	Recall	F1-Score	MCC
YOLO 8n	0.9601	0.9431	0.9436	0.9428	0.9430	0.8864
YOLO 8s	0.9504	0.9593	0.9593	0.9594	0.9593	0.9187
YOLO 8m	0.9582	0.9621	0.9620	0.9621	0.9621	0.9242
YOLO 8l	0.9598	0.9539	0.9543	0.9543	0.9539	0.9085
YOLO 8x	0.9600	0.9621	0.9626	0.9625	0.9621	0.9250
YOLO 11n	0.9640	0.9621	0.9623	0.9624	0.9621	0.9246
YOLO 11s	0.9651	0.9566	0.9571	0.9570	0.9566	0.9142
YOLO 11m	0.9541	0.9521	0.9512	0.9513	0.9512	0.9025
YOLO 11l	0.9524	0.9702	0.9702	0.9701	0.9702	0.9404
YOLO 11x	0.9630	0.9539	0.9540	0.9542	0.9539	0.9082

Table 10. The execution time for all the YOLO models in the five-class classification module.

Models	EarlyStopping	Final Learning Rate	Training Time (hours)	Inference Latency (ms)
YOLO 8n	23	0.0011	0.015	1.6
YOLO 8s	24	0.0011	0.016	3.8
YOLO 8m	24	0.0011	0.020	6.5
YOLO 8l	23	0.0011	0.021	11.2
YOLO 8x	18	0.0011	0.018	16.8
YOLO 11n	25	0.0011	0.024	8.3
YOLO 11s	26	0.0011	0.025	5.7
YOLO 11m	23	0.0011	0.024	20.2
YOLO 11l	28	0.0011	0.029	11.6
YOLO 11x	30	0.0011	0.032	9.2

Table 11. The performance metrics of all the YOLO models in the five-class classification.

Models	Validation Accuracy	Testing Accuracy	Precision	Recall	F1-Score	MCC
YOLO 8n	0.7680	0.7751	0.7521	0.7751	0.7559	0.6529
YOLO 8s	0.7818	0.7642	0.7483	0.7642	0.7514	0.6385
YOLO 8m	0.7956	0.7642	0.7608	0.7642	0.7576	0.6468
YOLO 8l	0.7680	0.7561	0.7368	0.7561	0.7427	0.6244
YOLO 8x	0.8228	0.8012	0.8514	0.8012	0.8143	0.7288
YOLO 11n	0.7983	0.7642	0.7503	0.7642	0.7529	0.6390
YOLO 11s	0.8066	0.7751	0.7671	0.7751	0.7700	0.6583
YOLO 11m	0.7707	0.7805	0.7585	0.7804	0.7646	0.6643
YOLO 11l	0.7845	0.7696	0.7585	0.7696	0.7467	0.6517
YOLO 11x	0.8039	0.7886	0.7774	0.7886	0.7816	0.6777

Table 12. Classification results with pretrained models.

Classification	Models	Testing Accuracy	Precision	Recall	F1-Score	AUC
Binary	ConvNeXtSmall	0.61	0.37	0.61	0.46	0.50
	ConvNeXtBase	0.62	0.76	0.61	0.47	0.51
	ConvNeXtTiny	0.60	0.30	0.50	0.38	0.50
	EfficientNetV2S	0.60	0.30	0.50	0.38	0.50
	EfficientNetB0	0.61	0.37	0.61	0.46	0.50
	ViT	0.61	0.37	0.61	0.46	0.50
5-class	ConvNeXtSmall	0.65	0.54	0.65	0.58	0.89
	ConvNeXtBase	0.68	0.59	0.68	0.63	0.89
	ConvNeXtTiny	0.66	0.55	0.66	0.59	0.88
	EfficientNetV2S	0.69	0.60	0.69	0.63	0.90
	EfficientNetB0	0.66	0.55	0.66	0.59	0.85
	ViT	0.39	0.15	0.39	0.22	0.47

Table 13. Comparative assessment of YOLO model variations with ranks per metrics.

Classification	Models	Val Accuracy	Test Accuracy	Precision	Recall	F1-Score	MCC	Avg Rank
Binary, Chi-square = 35.77 p-value = 4.33 × 10⁻⁵	YOLO 8n	4	10	10	10	10	10	9
	YOLO 8s	10	5	5	5	5	5	5.83
	YOLO 8m	7	3	4	4	3	4	4.16
	YOLO 8l	6	7.5	7	7	7.5	7	7
	YOLO 8x	5	3	2	2	3	2	2.83
	YOLO 11n	2	3	3	3	3	3	2.83
	YOLO 11s	1	6	6	6	6	6	5.16
	YOLO 11m	8	9	9	9	9	9	8.83
	YOLO 11l	9	1	1	1	1	1	2.33
	YOLO 11x	3	7.5	8	8	7.5	8	7
5-class Chi-square =44.76 p-value = 1.02 × 10⁻⁶	YOLO 8n	9.5	4.5	7	4.5	6	5	6.08
	YOLO 8s	7	8	9	8	8	9	8.16
	YOLO 8m	5	8	4	8	5	7	6.16
	YOLO 8l	9.5	10	10	10	10	10	9.91
	YOLO 8x	1	1	1	1	1	1	1
	YOLO 11n	4	8	8	8	7	8	7.16
	YOLO 11s	2	4.5	3	4.5	3	4	3.5
	YOLO 11m	8	3	5.5	3	4	3	4.41
	YOLO 11l	6	6	5.5	6	9	6	6.41
	YOLO 11x	3	2	2	2	2	2	2.16

Table 14. Comparison with previous studies based on utilizing YOLO model for DR task.

Reference, Year	Dataset	Methodology	Findings
[46], 2025	KLC and Shiromon1 datasets with binary classification	YOLO model for localization CNN, RF, and SVM for classification	With CNN-RF, the model shows 98.81% accuracy for the KLC data and 92.11% with the Shiromon1 dataset.
[47], 2024	Dataset Not specified 5 classes	The YOLOv8 model is compared with CNN, SVM, VGG16, and ResNet50.	VGG16 shows 63.47% accuracy compared to other models.
[50], 2023	APTOS and EyePACS with 5 classes	YOLOv7 for feature extraction MobileNetV3 for classification.	Accuracy: 98% for APTOS and 98.4% for the EyePACS dataset.
Proposed Work	Own Dataset	YOLO Version 8 and 11 for classification	Accuracy: 97.02% for binary and 80.12% for multiclass.

Table 15. The result of the ablation study.

Classification	CLAHE	Image Size	Accuracy	Precision	F1 Score
Binary model	No	640 × 640	0.9398	0.9789	0.9489
Binary model	No	240 × 240	0.9036	0.9775	0.9159
5-class model	No	640 × 640	0.7711	0.7331	0.7461
5-class model	No	240 × 240	0.7651	0.7541	0.7424

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mutawa, A.M.; Al Sabti, K.; Raizada, S.; Sruthi, S. Explainable AI for Diabetic Retinopathy: Utilizing YOLO Model on a Novel Dataset. AI 2025, 6, 301. https://doi.org/10.3390/ai6120301

AMA Style

Mutawa AM, Al Sabti K, Raizada S, Sruthi S. Explainable AI for Diabetic Retinopathy: Utilizing YOLO Model on a Novel Dataset. AI. 2025; 6(12):301. https://doi.org/10.3390/ai6120301

Chicago/Turabian Style

Mutawa, A. M., Khalid Al Sabti, Seemant Raizada, and Sai Sruthi. 2025. "Explainable AI for Diabetic Retinopathy: Utilizing YOLO Model on a Novel Dataset" AI 6, no. 12: 301. https://doi.org/10.3390/ai6120301

APA Style

Mutawa, A. M., Al Sabti, K., Raizada, S., & Sruthi, S. (2025). Explainable AI for Diabetic Retinopathy: Utilizing YOLO Model on a Novel Dataset. AI, 6(12), 301. https://doi.org/10.3390/ai6120301

Article Menu

Explainable AI for Diabetic Retinopathy: Utilizing YOLO Model on a Novel Dataset

Abstract

1. Introduction

2. Literature Review

YOLO Models

3. Materials and Methods

3.1. Dataset

3.1.1. Cleaning the Dataset

3.1.2. Sample Image

3.1.3. Image Quality

3.2. Splitting, Preprocessing, and Augmenting

3.3. YOLO Models

3.4. Model Parameters

3.5. Performance Metrics

3.6. Model Explainability and Demonstration

4. Results

4.1. Binary Classification

4.2. Multi-Class Classification

4.3. Comparison with Other Pretrained Models

5. Discussion

5.1. Ablation Study

5.2. Limitation and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI