Severity Classification of Diabetic Retinopathy Using an Ensemble Learning Algorithm through Analyzing Retinal Images

: Diabetic Retinopathy (DR) refers to the damages endured by the retina as an effect of diabetes. DR has become a severe health concern worldwide, as the number of diabetes patients is soaring uncountably. Periodic eye examination allows doctors to detect DR in patients at an early stage to initiate proper treatments. Advancements in artiﬁcial intelligence and camera technology have allowed us to automate the diagnosis of DR, which can beneﬁt millions of patients indeed. This paper inscribes a novel method for DR diagnosis based on the gray-level intensity and texture features extracted from fundus images using a decision tree-based ensemble learning technique. This study primarily works with the Asia Paciﬁc Tele-Ophthalmology Society 2019 Blindness Detection (APTOS 2019 BD) dataset. We undertook several steps to curate its contents to make them more suitable for machine learning applications. Our approach incorporates several image processing techniques, two feature extraction techniques, and one feature selection technique, which results in a classiﬁcation accuracy of 94.20% (margin of error: ± 0.32%) and an F-measure of 93.51% (margin of error: ± 0.5%). Several other parameters regarding the proposed method’s performance have been presented to manifest its robustness and reliability. Details on each employed technique have been included to make the provided results reproducible. This method can be a valuable tool for mass retinal screening to detect DR, thus drastically reducing the rate of vision loss attributed to it. of the model’s performance while both sets of features are provided. The average accuracy of the ten classiﬁcations while using only the histogram and GLCM features are 91.13% and 82.52%, respectively. The score reaches an average of 94.20% (95% conﬁdence interval: 93.88–94.51%) while the features were combined, which is over 3% higher than when only the histogram features were used.


Introduction
Diabetes mellitus (diabetes) collectively refers to a group of metabolic disorders that occurs if a person has an elevated sugar level in his/her blood and does not produce sufficient insulin to regulate it.Diabetes has become a severe health problem all over the world in the last few decades.According to the International Diabetes Federation (IDF) reports, globally, 463 million people between 20 and 79 had diabetes in 2019 [1].Based on their predictions, the number will reach an overwhelming 700 million by 2045.The World Health Organization (WHO) reported a 5% increase in premature mortality from diabetes since 2000 [2].Among the four primary types of diabetes, type 1 and type 2 diabetes are chronic, whereas genetic deficits and gestational diabetes are reversible in most cases [3].Diabetes causes numerous health complications, including vision impairment due to Diabetic Retinopathy (DR), heart attacks, kidney failure, and stroke.DR results from the vandalization of the blood vessels within the retina's tissue caused by prolonged diabetes.Consequently, blood, fluids, and lipids enter inside the macula and block the person's vision.DR patients often experience impaired color vision, blurred vision, loss of central vision, floaters within the vision space, and poor eyesight at night.At the advanced stages of DR, patients experience total blindness, and in most cases, the condition becomes permanent.
In the US, the number of cases of DR surged 89%, from 4.06 million to 7.69 million, within ten years (2000-2010) [4].According to their projections, the number of DR cases will reach 14 million by 2050.DR is the leading cause of vision impairment among people younger than 74 years [5].The number of diabetes cases is increasing worldwide, and it is becoming increasingly hard to diagnose DR among all the patients.The primary reason is that the disease shows almost no symptoms at its earlier stages [5].The only way to detect DR at these stages is to get regular eye checkups.However, with the growing number of diabetes cases, ensuring that all patients get regular checkups is very difficult, even for a developed country such as the US; according to a 2017 report, 21.4% of diabetes cases in the US went undiagnosed [6].Globally, the rate of undiagnosed diabetes cases is an alarming 50% (232 million people) [1].The traditional (manual) DR diagnostic systems are already inadequate to provide diagnoses to all diabetes patients.With the number of type 2 diabetes patients increasing in most countries, these manual diagnostic systems' capacity is merely insufficient to meet the requirement for constant checkups of all diabetes patients.That is the reason we need to look for alternative ways for DR diagnosis.With proper treatment initiated at the right time, more than 90% of the DR patients can recover their eyesight or prevent the disease from being vision-threatening [7].
Recent advancements in machine learning and artificial intelligence have imparted a far-reaching impact on numerous science and engineering fields.Clinical diagnosis is one of those fields that have benefited the most because of these developments.New and improved machine learning algorithms allowed researchers to automate the diagnosis of many diseases.Researchers have been trying to do the same for DR diagnosis as well.So far, several powerful methods have been proposed and practiced in this regard.Practical DR diagnosis is made based on retinal images collected using a procedure called Fundus Photography.These images capture the retina and its contents with vivid details.Convolutional Neural Network (CNN) based state-of-the-art methods work very well on clean DR images.The existing DR identification approaches differ in many aspects, including the source of the retinal images, methods of image pre-processing, type of the features extracted from retinal images, and the utilized machine learning algorithm.Most researchers use deep learning methods to work with retinal image datasets.Notably, CNNbased deep learning methods are very popular in this regard, and they are often handy.In the last 19 years, more than 425 articles have been published in renowned journals outlining various methods for DR detection [8].We present some of the notable works of recent times using deep learning algorithms in the following few paragraphs.
In 2020, D. Hemanth et al. proposed a Deep Convolutional Neural Network (DCNN)based DR detection and classification method [9].They used contrast-limited adaptive histogram equalization (CLAHE) and histogram equalization imaging techniques and acquired a classification accuracy of 97% and an F-measure of 94%.K. Shankar et al. reported a synergic deep learning model for DR classification incorporating both the image processing and segmentation techniques [10].They tested their method on the Messidor dataset and obtained over 99% classification accuracy.Gayathri et al. used a CNN model with six convolutional layers and two fully connected layers to extract features from fundus photographs [11].Then, the authors used several classifiers, including Support Vector Machine (SVM), Multilayer Perceptron, Random Forest (RF), and J48 to classify the samples of several datasets.Their proposed method acquired near-perfect classification outcomes in many cases.Liu et al. proposed three hybrid model structures based on CNN and an improved loss function (enhance cross-entropy) for DR classification in their 2020 article [12].They used five different popular deep learning architectures: EfficientNetB4, InceptionResNetV2, EfficientNetB5, NASNetLarge, and Xception as the base of their proposed method and trained three hybrid models on the outputs of these methods.The resultant method registered 86.34% classification accuracy and 98.77% sensitivity on the EyePACS dataset.Shankar et al. introduced a Hyperparameter Tuning Inception-v4 (HPTI-v4) model to classify DR images [13].They used CLAHE for image preprocessing and a histogram-based segmentation model for segmenting color fundus images.They acquired a peak accuracy of 96.25% on the Messidor dataset.Li proposed a Crossdisease Attention Network (CANet) for grading DR and Diabetic Macular Edema [14].In their study, the authors designed two attention modules: a disease-specific module that utilizes both the inter-spatial and inter-channel relationship among the features and a disease-dependent module that exploits the inter-channel features collected from the retinal images.They tested their method on two public datasets (ISBI 2018 IDRiD challenge and Messidor) and obtained 85.10% joined classification accuracy.
In 2019, Li et al. described a four-class (three non-prolific DR and one prolific DR) classification algorithm based on a customized DCNN architecture [15].In their study, they replaced the conventional max-pooling layers with fractional max-pooling layers.Their study involved 34,124 retinal images collected from a Kaggle competition dataset.However, their model scored an 86.17% classification accuracy making it somewhat convenient for practical use.Bellemo et al. proposed a CNN-based DR classification method trained using the Singapore National Diabetes Retinopathy Screening Program (SIDRP) dataset containing 76,370 retinal images [15] and validated their methods using another 4504 images collected from urban centers in the Copperbelt province of Zambia.Although they did not mention the accuracy, they provided the sensitivity and specificity values of their model, which are 92.25% and 89.04%, respectively.Sayres et al. used the popular EyePACS dataset, which contains 140,000 retinal images, to build a deep learning model [16].The primary purpose of their study was to understand the impact of DR-classification algorithms on physician readers in computer-assisted settings.They recorded an 88.40% classification accuracy with both the sensitivity and specificity scores over the 90th percentile.Zeng et al. proposed a Binocular Siamese-like CNN-based method employing 28,104 selected images from the EyePACS dataset [17].However, with 82.2% sensitivity and just 70.70% specificity, it can be said that their model did not register a reliable performance.Zhang et al. narrated an automatic DR detection and grading system following the Grading Scale of Diabetic Retinopathy (GSDR) [18].They used 3062 DR images in the training phase and performed external validation afterward.Their model has high sensitivity and specificity scores, both over 97%.de la Torre et al. proposed a CNN-based pixel-wise score-propagation model retaining 76,650 images from the EyePACS dataset [19].The specialty of their model lies in the architecture of the CNN; for every neuron, it divides the observed output score into two separate components.However, questions can be asked about the method's performance, as it renders sensitivity and specificity scores around 91% and does not go through any external validation.Pires et al. developed a multi-class DR-classification method sequentially employing multiple CNNs using 88,702 retinal images collected from two Kaggle competition datasets [20].They kept the trade-off between efficiency and effectiveness in mind while building the model, which makes it particularly suitable for implementing in smartphones.
Although non-deep learning-based DR detection methods are not so standard, they are not rare.In 2013, Akram et al. proposed a classification technique using the Gaussian Mixture Model and SVM [21].In 2014, Antal and Hajdu described a two-class (binary) DR-detection method incorporating several Decision Tree (DT)-based classifiers [22].In 2014, Wang et al. outlined an approach combining a CNN model with RF for blood vessel segmentation inside the retina [23].In 2016, Mane and Jadhav proposed a holoentropyenabled DT for DR identification [24].They tested their method on DIARETDB0 and DIARETDB1 fundus image databases and acquired 96.45% classification accuracy.In 2017, Saleh et al. proposed a DR assessment method using a fuzzy RF along with a dominancebased rough set balanced rule [25].In 2019, Jebaseeli et al. described a deep learning-based SVM (DLBSVM) for DR categorization [26].They procured 201 fundus images from five different datasets and obtained near-perfect classification performance on them.
Retinal images may contain a magnitude of noise and interference if collected using non-standard equipment and in non-ideal environments, which is often the case.This study deals with such a set of noisy retinal images collected from a reasonably new DR dataset.We employed several steps to suppress the effects of noise present in the samples.We calculated the histogram and Gray Level Co-occurrence Matrix (GLCM) of these images and used them as statistical features extracted from the samples.Then, we identified the most important features using a popular feature selection method named Genetic Algorithm (GA).We used those features to train a robust ensemble learning algorithm called Extreme Gradient Boosting (XGBoost).The proposed method resolves a multi-class classification problem, which corresponds to DR's severity in the associated retinal images.The XGBoost algorithm was hyper-tuned to obtain the best performance on this particular application.The acquired results proclaim that the proposed approach can successfully categorize the dataset samples.We provide necessary tables, figures, and graphs while describing its steps and performance for proper interpretation.
The rest of the paper is organized as follows.Section 2 discusses the employed dataset and its contents, cites the prior works involving it, outlines the study's methodology, and describes the image processing and classification techniques used in this study.Section 3 presents the obtained classification results and justifies their reliability.Finally, Section 4 provides concluding remarks on the paper, an overview of the entire article, and mentions some scopes for further research.

Materials and Methods
This study's workflow can be partitioned into four main sets of operations: retinal image collection, image pre-processing, feature extraction and selection, and DR model creation and optimization.We employed multiple image processing techniques to preprocess the retinal images, extract features from them, and a DT-based ensemble method known as XGBoost for image classification.The entire workflow has been presented in Figure 1 and narrated elaborately in the following subsections.

Employed Dataset and Prior Works on It
This study used the fundus images contained by the Asia Pacific Tele-Ophthalmology Society 2019 Blindness Detection (APTOS 2019 BD) dataset [27].This image dataset contains 3662 samples collected from many participants of rural India.Aravind Eye Hospital, India, organized the dataset.The hospital technicians traveled to India's remote areas to collect the retinal image samples using fundus photography.These fundus photographs were collected in varying conditions and environments over a long period.Later, a group of trained doctors reviewed and labeled the gathered samples following the principle of the International Clinical Diabetic Retinopathy Disease Severity Scale (ICDRSS).As per the scaling system, the APTOS 2019 BD samples are divided into five categories: no DR, mild DR, moderate DR, severe DR, and proliferative DR.The first category encompasses the healthy retinal samples where no DR is present.Each of the latter categories represents slightly more damaged retinas than the previous one.The last class, proliferative DR, is comprised of the samples that contain vitreous or pre-retinal hemorrhage.Sample retinal images of all the classes have been presented in Appendix A. Although not the first of its kind, APTOS 2019 BD is an adequate dataset to work with, and it has received quite the reception among researchers and data science experts.More than 2900 teams participated in the associated Kaggle competition, and since its emergence last year, around 20 research articles have been published (so far).We present a brief discussion on the existing body of works involving this dataset in the following few paragraphs.
In 2019, K. Singh and D. Drzewicki described a classification method using a pre-trained VGG-19 network [28].However, in their study, they used the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm instead of the famous Adam optimizer and claimed that it led to cleaner results.S. Kassani et al. reported a method where they used four deep learning architectures-InceptionV3, Mo-bileNet, ResNet50, and Xception to extract features from the dataset's images [29].Dekhil et al. adopted a transfer learning approach and classified the dataset's samples using the ImageNet architecture, which was previously trained on 3.2 million images [30].Sikder et al. presented a method incorporating the ExtraTree classifier, which is a popular ensemble learning algorithm [31].Their paper used several image processing techniques to pre-process the retinal images and extracted histogram features from them.
In 2020, Wang and Schaefer of Stanford University, California, US, described a classification method using a pre-trained MobileNetV2 architecture [32].They used data augmentation and a weighted loss function to classify the APTOS 2019 BD dataset's images.Sheikh presented a thesis using three deep learning models-DenseNet121, VGG16, and MobileNetV2 for DR classification [33].The author achieved the best classification outcome using MobileNetV2 architecture.Sheikh   Although not the first of its kind, APTOS 2019 BD is an adequate dataset to work with, and it has received quite the reception among researchers and data science experts.More than 2900 teams participated in the associated Kaggle competition, and since its emergence last year, around 20 research articles have been published (so far).We present a brief discussion on the existing body of works involving this dataset in the following few paragraphs.
In 2019, K. Singh and D. Drzewicki described a classification method using a pretrained VGG-19 network [28].However, in their study, they used the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm instead of the famous Adam optimizer and claimed that it led to cleaner results.S. Kassani et al. reported a method where they used four deep learning architectures-InceptionV3, MobileNet, ResNet50, and Xception to extract features from the dataset's images [29].Dekhil et al. adopted a transfer learning approach and classified the dataset's samples using the ImageNet architecture, which was previously trained on 3.2 million images [30].Sikder et al. presented a method incorporating the ExtraTree classifier, which is a popular ensemble learning algorithm [31].Their paper used several image processing techniques to pre-process the retinal images and extracted histogram features from them.
In 2020, Wang and Schaefer of Stanford University, California, US, described a classification method using a pre-trained MobileNetV2 architecture [32].They used data augmentation and a weighted loss function to classify the APTOS 2019 BD dataset's images.Sheikh presented a thesis using three deep learning models-DenseNet121, VGG16, and MobileNetV2 for DR classification [33].The author achieved the best classification outcome using MobileNetV2 architecture.Sheikh and Qidwai outlined a smartphone-based DR severity detection method [34].Their study used four CNN-based methods-DenseNet121, VGG16, RestNet50, and InceptionV3-for DR classification.The authors experimented with and without employing image pre-processing techniques.As expected, they achieved their peak performance when the images were pre-processed.Pak et al. performed multiple classifications on the dataset samples with various deep learning architectures such as DenseNet121, ResNet50, ResNet101, and EfficientNet-b4.They achieved the highest accuracy using EfficientNet-b4 [35].Gangwar and Ravi presented a Transfer Learning-based deep learning algorithm where they classified the images of this dataset with a pre-trained Inception-ResNet-v2 [36].They added a custom block of CNN to make it a hybrid model and tested this model on the Messidor-1 dataset as well.
Bodapati et al. proposed a multi-modal fusion module using pre-trained ConvNet models for DR classification [37].They performed two separate tasks-DR detection (binary classification) and determining its severity (multi-class classification).They also reported that a dropout at the input layer of a Deep Neural Network converges quicker while trained with blended features instead of uni-modal deep features.Kueterman described a classifier with a two-stage CNN architecture and SVM feature extraction for DR classification in a thesis work [38].Patel and Chaware presented a Transfer Learning-based Fine-tuned MobileNetV2 architecture [39].They experimented with customized and fine-tuned models with a varying number of epochs in their research.Liu et al. proposed a Graph REsidual rEranking Network (GREEN) in their reported work for DR classification based on the APTOS 2019 BD dataset [40].The EfficientNet-b0 architecture is the backbone of their described method.Dondeti  and lens condensation.A few retinal images excluded from this study have been sented in Appendix A. Such sample exclusion was also performed by authors of studies [43,44] They reported that this initial step is advantageous for efficient train and inference on the image dataset.If multiple classifications were performed in the study, the best results were taken."-" indicates that the score was not reported in the corresponding article/study.

Dealing with Duplicate Images
The dataset contains quite a few duplicate samples, most of which have b marked with different class labels.This kind of noise in labeling makes the learn model prone to misclassification.We calculated the similarities between every two ages based on their two-dimensional (2D) correlation coefficients to find such oc rences.The basic idea behind calculating the 2D correlation coefficient is to determ how much two matrices are similar to each other in a quantitative manner [45].If have two sample grayscale images, and , their 2D correlation coefficient can be termined using: where and are the mean of all values of the matrices and , respectively [ Equation (1) always results in a value within 0 and 1; 0 if and are nothing al and 1 if they are the same.Since we are working with color (RGB) images, we had calculate the coefficients separately for three different channels.If the outcomes of all three channels are 1, then we can conclude that the second image is a copy of the first o -Support Vector Machine ( Symmetry 2021, 13, x FOR PEER REVIEW and lens condensation.A few retinal images excluded sented in Appendix A. Such sample exclusion was als studies [43,44] They reported that this initial step is adv and inference on the image dataset.If multiple classifications were performed in the study, the best results were taken."-" in reported in the corresponding article/study.

Dealing with Duplicate Images
The dataset contains quite a few duplicate samp marked with different class labels.This kind of noise model prone to misclassification.We calculated the sim ages based on their two-dimensional (2D) correlation rences.The basic idea behind calculating the 2D correla how much two matrices are similar to each other in a have two sample grayscale images, and , their 2D c termined using: where and are the mean of all values of the matri Equation (1) always results in a value within 0 and 1; 0 and 1 if they are the same.Since we are working with calculate the coefficients separately for three different cha three channels are 1, then we can conclude that the second -SVM) for DR classification [41].They extracted deep features using the Neural Architecture Search Network (NASNet) architecture.Zhuang and Ettehadi used a modified Efficientnet-B3 architecture for DR classification [42].They also experimented with a shallow learning method in their study.Still, they acquired better results using the previous architecture.
Table 1 compares the reported results acquired by the classification methods applied to the APTOS 2019 BD dataset by the previously cited studies.As the table shows, CNN-based deep learning architectures did not show excellent results.Our research experimented with numerous deep learning models to reach a higher and more reliable classification outcome.However, this approach did not provide results significantly better than the existing methods.Therefore, we decided to work with manually extracted features and tree-based algorithms.After working with many sets of features and a few variants of bagging and boosting techniques, we reached the classification model described in the following section.If multiple classifications were performed in the study, the best results were taken."-" indicates that the score was not reported in the corresponding article/study.

Dealing with Duplicate Images
The dataset contains quite a few duplicate samples, most of which have been marked with different class labels.This kind of noise in labeling makes the learning model prone to misclassification.We calculated the similarities between every two images based on their two-dimensional (2D) correlation coefficients to find such occurrences.The basic idea behind calculating the 2D correlation coefficient is to determine how much two matrices are similar to each other in a quantitative manner [45].If we have two sample grayscale images, and , their 2D correlation coefficient can be determined using: where and are the mean of all values of the matrices and , respectively [46].Equation (1) always results in a value within 0 and 1; 0 if and are nothing alike, and 1 if they are the same.Since we are working with color (RGB) images, we had to calculate the coefficients separately for three different channels.If the outcomes of all the three channels are 1, then we can conclude that the second image is a copy of the first one.If multiple classifications were performed in the study, the best results were taken."-" indicates that the score was not reported in the corresponding article/study.

Pre-Processing of the Employed Dataset
Retinal images contained by the APTOS 2019 BD dataset are very convenient.The dataset has many noisy images, duplicate images labeled as different classes, images of various resolutions, and varying sample sizes.Before moving on to the feature extraction and learning stages, these issues need to be resolved or minimized to build a successful DR classification model.Below, we describe how we dealt with these challenges in this study.

Dealing with Excessively Noisy Images
The dataset contains several images with a high degree of noise and interference.All the images of the dataset were checked manually to single out the most affected ones.Those images were excluded from this study.We excluded samples in this study through visual inspection.A total of 566 images were discarded due to containing artifacts or being too underexposed, overexposed, or out of focus.Other reasons for exclusion include chromatic aberrations, digitization error, image saturation, blur, dust, darkness, and lens condensation.A few retinal images excluded from this study have been presented in Appendix A. Such sample exclusion was also performed by authors of the studies [43,44] They reported that this initial step is advantageous for efficient training and inference on the image dataset.

Dealing with Duplicate Images
The dataset contains quite a few duplicate samples, most of which have been marked with different class labels.This kind of noise in labeling makes the learning model prone to misclassification.We calculated the similarities between every two images based on their two-dimensional (2D) correlation coefficients to find such occurrences.The basic idea behind calculating the 2D correlation coefficient is to determine how much two matrices are similar to each other in a quantitative manner [45].If we have two sample grayscale images, I r A and I r B , their 2D correlation coefficient can be determined using: where I r A and I r B are the mean of all values of the matrices I r A and I r B , respectively [46].Equation (1) always results in a value within 0 and 1; 0 if I r A and I r B are nothing alike, and 1 if they are the same.Since we are working with color (RGB) images, we had to calculate the coefficients separately for three different channels.If the outcomes of all the three channels are 1, then we can conclude that the second image is a copy of the first one.Suppose we want to calculate the coefficients between every two images of the 3096 images (remaining after noisy image exclusion).In that case, the operation has to be repeated 3096 C 2 times, which would take a long time to perform and may seem unreasonable.Therefore, we first checked if the resolution (height and width in pixels) of two images match to reduce the number of operations.The 2D correlation coefficients of the two images were calculated only if the images had the same resolution.The whole process of duplicate image checking is summarized in Algorithm 1. Correlations of all three channels were checked for extra precision.We primarily focused on finding the exact images/files marked as different classes (possibly by different graders) with this algorithm.However, duplicates may exist in images of different sizes as well.Since the total number of samples was not very high, removing too many samples could have made the model less reliable.Using Algorithm 1, we found duplicates of 115 retinal images within the remaining 3096 images (after the previous step).In most cases, the copies of a sample have different class labels.We randomly picked one of the duplicate images and discarded the other(s).The rest of the study was conducted with the remaining 2981 retinal images.A decision on whether or not I B is a duplicate of Step-1 : Determine the resolution of I A and I B : Determine corr(I A , I B ) using Equation ( 1) for three channels.else return 0 Step-2 : if corr(I A , I B ) ==1 for all three channels return 1 else return 0

Dealing with Unwanted Black Borders
The majority of the images of the APTOS 2019 BD dataset contain black pixels close to the borders.As these pixels do not carry any information on DR's existence in the associated retinal image, a simple image crop operation can remove them.We summarize the process in Algorithm 2. This operation is relatively straightforward for the images that contain the entire eyeball inside its frame, such as Figure 2a.However, most of the images (such as Figure 2c) have part of the retina missing (from top and bottom), which is also another drawback of the dataset.In these cases, black pixels were only removed from the left and right sides to prevent further loss of useful information (from inside the eyeball).This step is helpful, since we will extract histogram features from the images later on.Figure 2b,d show the resultant images of the crop operation applied on the images of Figure 2a,c, respectively.Step-2 : Determine the center pixel of the image: Step-3 : Identify the first non-black pixel directly above the center pixel: for = 1 to ℎ if ( , ) > 10 in all three channels ← ; break directly left to the center pixel: for = 1 to if ( , ) > 10 in all three channels ← ; break directly below the center pixel: for = ℎ to 1 if ( , ) > 10 in all three channels ← ; break directly right to the center pixel: for = to 1 if ( , ) > 10 in all three channels ← ; break Step-4 : Crop the image: for = 1 to 3 (: , : ) ← ( : , : , ) Step-5 : return

Dealing with Uneven Image Resolution
As mentioned before, the APTOS 2019 BD dataset images have varying resolutions, ranging from 474 × 358 to 3388 × 2588 pixels in width and height, respectively.This dissimilarity can be an issue for our further processing.Therefore, we performed an image resize operation to rescale all the images to 256 × 256 pixels in this step.The images that do not contain the entire eyeball (such as Figure 3a) will be compressed along its width, as shown in Figure 3b.This step might seem unideal, but we allowed this compression since we aimed to maintain a uniform square shape and did not want to lose further information from the images.Step-2 : Determine the center pixel of the image: : Identify the first non-black pixel directly above the center pixel: directly below the center pixel: for i = 1 to h for i = h to 1 if (i, C w ) > 10 in all three channels if (i, C w ) > 10 in all three channels C 1 ← i ; break C 2 ← i ; break directly left to the center pixel: directly right to the center pixel: Step-4 : Crop the image: : return I A

Dealing with Uneven Image Resolution
As mentioned before, the APTOS 2019 BD dataset images have varying resolutions, ranging from 474 × 358 to 3388 × 2588 pixels in width and height, respectively.This dissimilarity can be an issue for our further processing.Therefore, we performed an image resize operation to rescale all the images to 256 × 256 pixels in this step.The images that do not contain the entire eyeball (such as Figure 3a) will be compressed along its width, as shown in Figure 3b.This step might seem unideal, but we allowed this compression since we aimed to maintain a uniform square shape and did not want to lose further information from the images.2, this operation does not solve the issue entirely but reduces it to some extent.After augmentation, we have 5285  2, this operation does not solve the issue entirely but reduces it to some extent.After augmentation, we have 5285 retinal images, where 26.60% of the samples belong to the majority class and 14% belong to the minority class.As mentioned earlier, the dataset images contain noise and interference.Many of the images are partially blurry or hazy, making it more challenging to mine distinguishable features from them.One way to deal with this issue is to increase the sharpness of the images, emphasizing their texture by making the borders among different regions more apparent.This study used unsharp masking, a well-known image sharpening method, to enhance the retinal images' sharpness.The main idea behind unsharp masking is to subtract a blurred version of the retinal image from the original one to detect its edges.Then, a mask is constructed with the acquired edge details.Finally, the contrast is increased at the edges, and the effect is applied to the original image.If we take a sample image, I A , its sharpened form I S can be calculated as follows [47]: Here, h is the height and width of the image in pixels (since we have square-shaped images from the image resize operation), λ is a factor that adjusts the intensity of sharpening, and ∇(h, h) is a suitably-defined gradient at (h, h) [48].One of the widely used gradient functions is the discrete Laplacian operator [49], which can be defined as follows: We can get I S using Equations ( 2) and ( 3).The resultant image is shaped by three parameters of the operation-amount, radius, and threshold.Amount refers to the intensity of sharpness.Radius governs the area around the edges that are affected by the sharpening operation.A large value of the radius will sharpen a wider region around the edges, and a smaller value will do so to a narrower region.Threshold allows the algorithm to decide if a pixel will be considered an edge pixel and reduces the sharpening noise.In this study, we used [2, 1, 0.1] as the value of the amount, radius, and threshold, respectively.Figure 4b shows a sample retinal image (Figure 4a) sharpened using the unsharp masking technique.tensity of sharpness.Radius governs the area around the edges that are affected by the sharpening operation.A large value of the radius will sharpen a wider region around the edges, and a smaller value will do so to a narrower region.Threshold allows the algorithm to decide if a pixel will be considered an edge pixel and reduces the sharpening noise.In this study, we used [2, 1, 0.1] as the value of the amount, radius, and threshold, respectively.Figure 4b shows a sample retinal image (Figure 4a) sharpened using the unsharp masking technique.

Image Pre-Processing
We employed one more image processing technique to pre-process the retinal images before extracting features from the resultant sharpened images.We used Tone mapping, which is an imaging technique widely used in digital cameras, at this step.Tone mapping transforms a high dynamic range image into a low dynamic range image.Digital cameras use this operation to make the captured images more suitable for digital displays.However, we used this technique to compress the dynamic range of the retinal images and preserve the crucial information within them, such as local contrast, global contrast, details, etc., at the same time.Each tone of the original image is mapped to achieve a tone suitable for viewing.Tone mapping is carried out on a three-channel image ( ) using an operator defined as follows: where ℝ ⊆ ℝ, ⊂ ℝ , and = [0,255] [50].Tone mapping changes the luminance information of the original image while leaving the color information preserved.Equation (4) can be simplified as follows:

Image Pre-Processing
We employed one more image processing technique to pre-process the retinal images before extracting features from the resultant sharpened images.We used Tone mapping, which is an imaging technique widely used in digital cameras, at this step.Tone mapping transforms a high dynamic range image into a low dynamic range image.Digital cameras use this operation to make the captured images more suitable for digital displays.However, we used this technique to compress the dynamic range of the retinal images and preserve the crucial information within them, such as local contrast, global contrast, details, etc., at the same time.Each tone of the original image is mapped to achieve a tone suitable for viewing.Tone mapping is carried out on a three-channel image (I A ) using an operator defined as follows: where R i ⊆ R, D O ⊂ R i , and D O = [0, 255] [50].Tone mapping changes the luminance in- formation of the original image while leaving the color information preserved.Equation (4) can be simplified as follows: where L d and H d are the low dynamic range and high dynamic range luminance values of a pixel, respectively.∫ ∈ (0, 1] is called the saturation factor.R L , G L , and B L are the low dynamic ranges of red, green, and blue channels; and R H , G H , and B H are the high dynamic ranges of red, green, and blue channels, respectively.In this study, the global tone mapping system was used, where all the pixels are mapped with the same value of .
In the local tone mapping system, the mapping of each pixel depends on its neighboring pixels.can be applied to each pixel of a retinal image in three different ways: linear scaling, exponential mapping, and logarithmic mapping [50].In linear scaling, the main image is multiplied by a factor , which is defined as follows: where p i is a sample pixel.For logarithmic mapping and exponential mapping, the relationships between L d and H d are defined in Equations ( 7) and ( 8), respectively.
where k 1 ∈ (1, ∞] and k 2 ∈ [∞, 1) are two constants.Although their outcomes are visibly similar, we used the exponential mapping operation to approximate the retinal images' appearance in this study.Figure 4c presents a tone-mapped outlook of the image shown in Figure 4b.

Feature Set Creation
With the retinal images' pre-processing, we can now move on to the next step of the workflow: feature extraction.Figure 5 shows the tree channels' information of a sample processed retinal image.As seen from the figure, most of the blue channel pixels are black or near-black, indicating very little distinguishable information useful for the learning algorithm.The color images' blue channel can be omitted to lessen the number of features.However, as we plan to use a feature selection algorithm, later on, we are keeping the channel and the features extracted from it for now.We extracted two sets of features from each retinal image: histogram and GLCM.Below, we describe these two widely used imaging techniques.

Histogram Feature Extraction
The histogram of an image portrays the relative frequency of occurrence of the varying gray levels in that image [49].If we consider the range of the gray labels, a sample gray label image is (0 to − 1), which can be expressed as a discrete function as follows: where is the -th gray label, is the number of pixels with that particular gray label, and ℎ × is the number of total pixels of the image [51].Since we are working with 8-bit color images, in our case, the range of the gray labels is from 0 to 255. Figure 6 provides a histogram plot (not-normalized) of the image's three channels presented in Fig- ure 5a.As seen from the figure, most of the blue channel pixels reside within the gray labels ranging from 0 to 50, whereas the red and green channels contain a wide range of gray labels.

7000
Red channel

Histogram Feature Extraction
The histogram of an image portrays the relative frequency of occurrence of the varying gray levels in that image [49].If we consider the range of the gray labels, a sample gray label image is (0 to l − 1), which can be expressed as a discrete function as follows: where g k is the k-th gray label, n k is the number of pixels with that particular gray label, and h × w is the number of total pixels of the image [51].Since we are working with 8-bit color images, in our case, the range of the gray labels is from 0 to 255. Figure 6 provides a histogram plot (not-normalized) of the image's three channels presented in Figure 5a.As seen from the figure, most of the blue channel pixels reside within the gray labels ranging from 0 to 50, whereas the red and green channels contain a wide range of gray labels.
where is the -th gray label, is the number of pixels with that particular gray label, and ℎ × is the number of total pixels of the image [51].Since we are working with 8-bit color images, in our case, the range of the gray labels is from 0 to 255. Figure 6 provides a histogram plot (not-normalized) of the image's three channels presented in Fig- ure 5a.As seen from the figure, most of the blue channel pixels reside within the gray labels ranging from 0 to 50, whereas the red and green channels contain a wide range of gray labels.

Gray-Level Co-Occurrence Information Extraction
GLCM primarily highlights the texture information of an image.It can be expressed as a matrix function of the distance and angle between neighboring pixels.Haralick et al. worked extensively with different variants of GLCM of an image and other parameters that can be derived from it in the 1970s [52,53].The GLCM of a grayscale image is a measure of the frequencies at which different combinations of gray levels or pixel values occur in that image [54].We are working with 256 pixels × 256 pixels retinal images, where the value of each pixel lies in the range of 0-255.First, to calculate the GLCM of such an image, its pixel values are rescaled to distinct levels 1-8.The values can also be scaled to other or more discrete levels if necessary.However, in this study, levels 1-8 were used.This newly formed matrix is called a Scaled image (S).GLCM expresses how often a pixel value (k) occurs either diagonally, horizontally, or vertically to another pixel value (l).The co-occurrence matrix (C m ) of the image is specified with a displacement vector d = {(row, column)} shown in Figure 7. C m (k, l) points out how many times a k is separated from l by the displacement vector.Therefore, C m (k, l) can be defined as follows: Symmetry 2021, 13, x FOR PEER REVIEW 15 of 28 Pixel of interest (i, j) The resultant matrix is an 8 × 8 matrix containing the frequencies of different combinations of (k, l) in S [55].In this study, we chose to work with the 0 • [1, 0] displacement vector shown in Figure 7. Figure 8 shows the GLCM of the retinal image shown in Figure 5a.

Feature Concatenation
The above-described feature extraction techniques were applied to each channel of the retinal images.Histogram features calculated from the three channels were concatenated, which resulted in 768 features in total.GLCM features derived from each channel were flattened to consider them as a row vector of 64 features.Three GLCMs from three channels result in a total of 192 features.Combining these two sets of features, we obtained a set of 960 features extracted from a single retinal image.The process was repeated for all the 5285 samples to acquire the complete set of features used in this study.

Selection of the Most Relevant Features
The dimension of the training data is a crucial factor in machine learning.Although high-dimensional data contain more information about the corresponding sample, the samples become sparser and far apart from the other samples of the same class in a high-dimensional space, making it harder for learning algorithms to set decision boundaries.This phenomenon is known as the curse of dimensionality, and it has a profound effect on the performance of any machine learning algorithm [56].Feature selection methods offer a simple solution to this problem by identifying the most important features and omitting the rest.Various feature selection methods are available that work based on different principles.In this study, we choose to work with a DT-based GA.GAs

Feature Concatenation
The above-described feature extraction techniques were applied to each channel of the retinal images.Histogram features calculated from the three channels were concatenated, which resulted in 768 features in total.GLCM features derived from each channel were flattened to consider them as a row vector of 64 features.Three GLCMs from three channels result in a total of 192 features.Combining these two sets of features, we obtained a set of 960 features extracted from a single retinal image.The process was repeated for all the 5285 samples to acquire the complete set of features used in this study.

Selection of the Most Relevant Features
The dimension of the training data is a crucial factor in machine learning.Although high-dimensional data contain more information about the corresponding sample, the samples become sparser and far apart from the other samples of the same class in a highdimensional space, making it harder for learning algorithms to set decision boundaries.This phenomenon is known as the curse of dimensionality, and it has a profound effect on the performance of any machine learning algorithm [56].Feature selection methods offer a simple solution to this problem by identifying the most important features and omitting the rest.Various feature selection methods are available that work based on different principles.In this study, we choose to work with a DT-based GA.GAs are evolutionary algorithms; they work with candidate solutions and gradually evolve toward better outcomes [57].GAs are inherently inspired by the process of natural selection and have gotten most of their terminology from that branch of biology.
A typical GA starts with several random guesses, which is known as a population.The population usually spreads throughout the entire search space.A basic GA performs three primary operations to direct the population-selection, crossover, and mutation [58].A GA aims to converge toward the global optimum over a series of steps, which are known as generations.As the names suggest, these operators' functionalities are similar to the process of natural selection.Selection attempts to pressurize the population members so that betterperforming or "fitter" members are propagated toward the next generation, and weaker members are thrown out of the pool.This phenomenon strengthens the performance of the population of the next generation.Crossover allows multiple possible solutions to exchange information among them and create numerous subsets of new solutions, some of which might be better than the initial solutions.Mutation randomly alters values to see if that changes the outcome positively.Since this may cause radical changes in the outcome, the amount of mutation is typically kept very low.A new population emerges as a result of applying these operators to the previous population.This generation is again modified by the operators and propagated toward the next generation.This process is repeated until a predefined number of generations have elapsed or a certain criterion of convergence has been met, or if the solution remains unchanged for a certain number of generations.Algorithm 3 describes the working procedure of a basic GA [59].Step-1 : population of p s random individuals, P 0 Step-2 : for k = 0 to ∞ sum ← 0 Compute the fitness of the population: Pr k [j] ← {(j,)/sum Select and breed: for i = 1 to p s /2 select j 1 , j 2 ∈ P k based on Pr k j 1 , j 2 ← crossover(j 1 , j 2 )

DR Image Classification
XGBoost is a DT-based ensemble learning algorithm used for solving both classification and regression problems.XGBoost falls into the category of boosting algorithms as it improves its prediction power by training a set of weak learners and gradually converting them into strong ones.It was developed by Chen and Guestrin in 2014; XGBoost is an improvement upon the basic Gradient Boosting algorithm [60].There are quite a few advantages to using XGBoost.First of all, it is faster than the other boosting methods, since it builds trees parallelly, whereas the other boosting methods build trees sequentially, thus learning from the data at slower rates.Secondly, XGBoost has a built-in regularization technique to minimize overfitting [61].Thirdly, it uses an approximation algorithm that speeds up the model training process.Lastly, XGBoost can efficiently handle weighted and sparse data and supports out-of-core computing.As a result of these reasons, XGBoost has become a commonly used DT-based supervised learning algorithm.To describe how XGBoost works, let us consider a dataset, D = {(x i , y i )} such that |D| = n, x i ∈ R m , and y i ∈ R n which has m features and n examples.The model output of a boosting algorithm with T trees is defined as follows: where F = { f (x) = ω (x)} is a set of trees built to solve a classification task [62].Each f t divides a tree into two parts: leaf weight part (ω) and structure part ( ). f t can be learned by minimizing the following objective function: Here, is a training loss function measuring the distance between ŷi and y i , which are the prediction and the object, respectively.Ω represents the penalty of the model complexity.Traditional optimization methods cannot optimize a boosting algorithm with the objective function expressed in Equation ( 12) in Euclidean space.In the Gradient Boosting algorithm, the prediction and the objective function of the k-th iteration are defined as follows: XGBoost uses the second-order Taylor expansion to approximate Equation ( 14).The final objective function can be expressed as follows: where } i is the first-order gradient statistics and i is the second-order gradient statistics on the loss function.In XGBoost, Here, L is the number of leaves in the tree.Let us take I j = {i : (x i ) = j} as a set of instances.Now, by expanding Ω, Equation ( 15) can be simply rewritten as follows: For a tree structure, the solution weight ω * j of leaf j can be obtained from [63]: Combining Equations ( 17) and ( 18), we can write: Equation ( 19) can be used to evaluate the tree (x) and look for the most optimal tree structures.However, it is not possible to go through all of them.[60] outlined a greedy algorithm that is useful in this situation.It starts from a single leaf and attaches branches after each iteration to gradually grow the structure.Whether or not a particular split will be added to the existing structure is determined by the following function: where I = I l ∪ I r , and I l and I r are the instance sets of the left and right nodes, respectively, after the split.After a node split in the tree, that split is judged in terms of the change in the model's performance based on the objective function.If the performance has improved, the associated split will be adopted; otherwise, it will be stopped.Furthermore, while optimizing the objective function, XGBoost usually faces less overfitting than other boosting techniques because of this regularization.

Results and Discussion
In this section, we present the results acquired by the described classification model.As stated in the previous section, we used XGBoost as our main classification algorithm.In the following subsection, we present the outcomes based on the combined feature set.In the following subsection, we demonstrate the results based on the GA's selected features only.The reason behind presenting them separately is to portray how the method's performance changes if a smaller number of features are used to represent the retinal images to the classifier.
We used 75% of the data (assigned after shuffling the samples) to train the supervised learning algorithm and the remaining 25% to test it.As a non-linear learning algorithm, XGBoost is often prone to overfitting, even though it has a built-in regularization technique to minimize it.Overfitting occurs when the learning algorithm develops hypotheses to fit the training data and becomes too specific for the test data.As a result, the model can classify the training subset's samples with a high degree of accuracy but exhibits considerably poor performance on the testing subset's samples.To build an ideal model, we need to reduce overfitting as much as possible.If properly tuned, several parameters of the XGBoost algorithm can help to mitigate overfitting by controlling the model's complexity, increasing randomness while training, and making the model more vigorous to the noise present in the data [64].In this study, we took a heuristic approach and varied a few parameters to obtain their best values.These parameters include the number of boosting stages to perform (n_estimators), the maximum depth of a tree (max_depth), the minimum sum of instance weight required in a child (min_child_weight), the subsample ratio of the training instances (subsample), the random seed given to each estimator at each boosting iteration (random_state), and the rate of learning from training data (learning_rate).Similar XGBoost hyperparameter tuning has been done by the authors of the study [65].
Figure 9 presents the outcomes of these exhaustive operations carried out to find the most suitable configuration for our application.These 3D graphs show the accuracy of a particular testing subset of data at different max_depth, n_estimators, and learning_rate values.Parameters such as min_child_weight and random_state were kept constant while carrying out these classifications.As the figure illustrates, XGBoost achieved accuracy scores lower than 94% when the trees' maximum depth was restricted to below 5.However, the accuracy scores reached over 94%, while max_depth was set to 5 or 6.These points have been marked with red in the graphs.As we can see, there are several marked areas in Figure 9c,d.The values assigned to these parameters for further classifications have been listed in Table 3.

Classification Outcomes Based on the Entire Feature Set
After hyper-tuning the classification algorithm for the prepared dataset, we moved on to the main classification operation.We performed 10-fold cross-validation on the dataset to claim a more robust classification outcome.The 75%-25% train-test split was

Classification Outcomes Based on the Entire Feature Set
After hyper-tuning the classification algorithm for the prepared dataset, we moved on to the main classification operation.We performed 10-fold cross-validation on the dataset to claim a more robust classification outcome.The 75-25% train-test split was maintained in each fold.Figure 10a presents the tuned XGBoost model's classification outcomes on ten different folds of the prepared dataset regarding classification accuracy.For clarity, we have presented the results on individual feature sets and the combined set to show the increase of the model's performance while both sets of features are provided.The average accuracy of the ten classifications while using only the histogram and GLCM features are 91.13% and 82.52%, respectively.The score reaches an average of 94.20% (95% confidence interval: 93.88-94.51%)while the features were combined, which is over 3% higher than when only the histogram features were used.
presented in Figure 10a.This indicates that the model is not too biased because of the imbalance of the dataset.However, the F-measure scores are a bit lower than the accuracy scores.On average, the scores are 90.92%,80.11%, and 93.51% based on the histogram, GLCM, and combined features, respectively.Table 4 presents these fold-wise results in numeric forms, along with the precision, recall, and Quadratic Weighted Kappa (QWK) scores of the classifications.All the scores provided in this study were achieved on the corresponding testing subsets.Figure 11a presents the aggregated confusion matrix of the ten classifications.This matrix shows the number of correctly and wrongly classified samples along with their actual and predicted labels.This representation allows us to take a deeper look at the classification model's strengths and weaknesses.As seen from the figure, most of the samples of the NO, MI, SE, and PR classes were classified correctly.However, the model faced some challenges in accurately identifying the MO class samples; thus, it misclassified almost 30% of its samples.The issue is reflecting in the class's receiver operating characteristics (ROC) curve as well, as presented in Figure 11b.As seen from the figure, the MO curve is the furthest away from the top-left corner of the graph, which indicates that the model has the least success rate in this class [66].The other four classes' curves are very close to the corner with approximate AUC scores of 1.00 each.The presented ROC curve has been constructed from the classification outcomes of the 8th fold, since its Figure 10b provides the F-measure scores of each fold of the 10-fold cross-validation operation.F-measure is a performance measurement score that takes both the precision and recall scores of the classification into account.More importantly, F-measure is not affected by the uneven class sample sizes of the data, which is the case here.As the figure illustrates, the F-measure bars reached heights similar to the corresponding accuracy bars presented in Figure 10a.This indicates that the model is not too biased because of the imbalance of the dataset.However, the F-measure scores are a bit lower than the accuracy scores.On average, the scores are 90.92%,80.11%, and 93.51% based on the histogram, GLCM, and combined features, respectively.Table 4 presents these fold-wise results in numeric forms, along with the precision, recall, and Quadratic Weighted Kappa (QWK) scores of the classifications.All the scores provided in this study were achieved on the corresponding testing subsets.Figure 11a presents the aggregated confusion matrix of the ten classifications.This matrix shows the number of correctly and wrongly classified samples along with their actual and predicted labels.This representation allows us to take a deeper look at the classification model's strengths and weaknesses.As seen from the figure, most of the samples of the NO, MI, SE, and PR classes were classified correctly.However, the model faced some challenges in accurately identifying the MO class samples; thus, it misclassified almost 30% of its samples.The issue is reflecting in the class's receiver operating characteristics (ROC) curve as well, as presented in Figure 11b.As seen from the figure, the MO curve is the furthest away from the top-left corner of the graph, which indicates that the model has the least success rate in this class [66].The other four classes' curves are very close to the corner with approximate AUC scores of 1.00 each.The presented ROC curve has been constructed from the classification outcomes of the 8th fold, since its accuracy score is the closest to the average.Table 5 lists the method's class-wise performance in terms of their individual specificity, precision, recall, and F-measure scores.As seen from the table, apart from the MO class, we achieved an F-measure score of over 95% in each category.The MO class suffers from poor recall, which has been reflected in the previous results as well.

Classification Outcomes Based on the Selected Features
As mentioned earlier, selecting the most powerful features and using them to train the learning algorithm can significantly reduce the amount of training data, which, in turn, reduces the model's time and space complexity.After concatenating the histogram

Classification Outcomes Based on the Selected Features
As mentioned earlier, selecting the most powerful features and using them to train the learning algorithm can significantly reduce the amount of training data, which, in turn, reduces the model's time and space complexity.After concatenating the histogram and the GLCM features, we acquired a set of 960 features for each sample.As seen in the previous subsection, the XGBoost algorithm is well capable of working with the entire feature set and provides very good classification performance.However, we employed a GA to reduce the number of features further.The associated parameters of the GA are listed in Table 6.Since we performed 10-fold cross-validation, we had different training and testing sets in each fold.GA requires the corresponding class labels of the samples to calculate the fitness of each generation.We selected each fold's best features by providing the fold's training samples and their labels to the GA.Hence, each fold's GA output was a set of features best fit to classify that particular fold's training samples.Figure 12a presents the accuracy and F-measure scores of classifications using XGBoost on the testing subsets of the ten different folds based on the feature sets selected by GA (in each fold).It also provides the number of features GA selected in each fold above the corresponding bars.As seen from Figures 10 and 12a, the latest classification outcomes are very similar to those achieved based on the combined set of features.However, since we have disregarded the majority of the features, the scores fell slightly.The average accuracy of the ten folds is 93.30% (margin of error: ±0.272%), and the F-measure is 92.38% (margin of error: ±0.258), which is just 0.9% and 1.13% less than the respective scores achieved based on the combined set of features.The samples of the folds, as well as the training and testing subsets, were not changed during these experiments.However, these results were acquired by using fold-dependent feature sets selected by GA.

DR Classification Performace Comparison
We have compared the proposed method's performance with other reported methods in Table 1.The table clearly shows that our method outperforms most of the previously reported methods by some margins.[33] and [34] are the only two studies describing two deep learning-based methods that achieved over 90% classification accuracy.However, [33] has poor precision and recall scores, which affects the F-measure.As discussed in Section 2.1, the popular deep learning methods used by other similar studies did not provide very good classification outcomes on this dataset.At the early stage of our study, we too experimented with some popular deep learning models to categorize the samples of the dataset.However, none of them yielded good and stable classification outcomes, which motivated us to look for alternative learning methods in the first place.
The described method incorporates several steps of image exclusion and pre-processing operations.These operations were performed to reduce the level of noise present in the retinal images and make the samples more classifiable at the learning stage.We present the classification results without some of those operations in Table 7.All the other associated parameters of the learning method were kept precisely the same.However, the samples were re-shuffled during some of those classifications, so the folds' training and testing subsets may not contain the same samples.As seen from the table, on To unify these sets, we performed a basic set union operation on the acquired (ten) sets of selected features.This joined set of features, which we will call the Union set of selected features, contains all the feature indexes selected by GA over the ten folds without reparation.We performed a further classification of the retinal samples using the Union set of selected features, which contains 669 features in total.The results of the fold-wise classifications have been presented in Figure 12b.Our classifier of choice, XGBoost, in its described setting (Table 3) categorized the samples of the dataset with a 93.70% classification accuracy (margin of error: ±0.222%) and a 92.38%F-measure (margin of error: ±0.258%).In comparison, that is 0.5% and 1.13% lower than the average accuracy and F-measure scores acquired based on the combined set of features, which is negligible considering that the Union set of selected features contains 30% fewer features than that of the combined set.Different feature selection outcomes can be achieved by varying the employed GA's parameters or using another feature selection method.We intend to explore multiple feature selection algorithms in our future studies.

DR Classification Performace Comparison
We have compared the proposed method's performance with other reported methods in Table 1.The table clearly shows that our method outperforms most of the previously reported methods by some margins.[33] and [34] are the only two studies describing two deep learning-based methods that achieved over 90% classification accuracy.However, [33] has poor precision and recall scores, which affects the F-measure.As discussed in Section 2.1, the popular deep learning methods used by other similar studies did not provide very good classification outcomes on this dataset.At the early stage of our study, we too experimented with some popular deep learning models to categorize the samples of the dataset.However, none of them yielded good and stable classification outcomes, which motivated us to look for alternative learning methods in the first place.
The described method incorporates several steps of image exclusion and pre-processing operations.These operations were performed to reduce the level of noise present in the retinal images and make the samples more classifiable at the learning stage.We present the classification results without some of those operations in Table 7.All the other associated parameters of the learning method were kept precisely the same.However, the samples were re-shuffled during some of those classifications, so the folds' training and testing subsets may not contain the same samples.As seen from the table, on average, the classification model's accuracy degrades by 2.4%, 2.9%, and 0.9% without image exclusion, unsharp masking, and Tone mapping, respectively.Similar outcomes can be observed in the F-measure scores as well.Based on these results, it can be concluded that these pre-processing operations have elevated the overall performance of the presented DR severity classification method.

Conclusions and Future Work
This article described a novel ensemble learning-based method to determine DR's severity according to the ICDRSS standard.We tested our proposed method on the APTOS 2019 BD dataset.First of all, we had to deal with a few dataset issues, including excessively noisy images, duplicate images with improper labeling, uneven image resolution, and varying class sample sizes.After that, we applied some image processing techniques to prepare the images for feature extraction, which was done later by calculating each retinal image's histogram and GLCM.Then, we hyper-tuned the XGBoost algorithm to provide the best performance on our created feature set.Lastly, we employed a GA to single out the most important features for classification and showed that doing so does not significantly drop the method's performance.The method offers excellent performance while identifying the samples of four of the five classes.The method provides an average classification accuracy of 94.20% (95% confidence interval: 93.88-94.51%)based on the entire feature set and 93.70% (95% confidence interval: 93.48-93.93%)based on the selected features.Other evaluation matrices presented in the article support the acquired results

Figure 1 .
Figure 1.The entire workflow of the proposed Diabetic Retinopathy (DR) classification method.

Figure 1 .
Figure 1.The entire workflow of the proposed Diabetic Retinopathy (DR) classification method.

Algorithm 1 :
An algorithm to detect duplicate images of the same size Input Output : : Two-color images, I A and I B .

Symmetry 2021 , 28 Algorithm 2 :
13, x FOR PEER REVIEW 9 of An algorithm for image crop Input Output : : A retinal image, A cropped image without black borders, Step-1 : Determine the resolution of and .(ℎ, ) ← height & width of (in pixel).

Figure 2 .
Figure 2. (a) A DR image with unwanted black borders on all four sides, (b) output of the described crop operation on (a), (c) an image with black edges only on sides, and (d) output of the crop operation on (c).

Figure 2 . 1 :
Figure 2. (a) A DR image with unwanted black borders on all four sides, (b) output of the described crop operation on (a), (c) an image with black edges only on sides, and (d) output of the crop operation on (c).

Symmetry 2021 , 28 Figure 3 .
Figure 3. (a) A DR image with different numbers of pixels in height and width and (b) output of the resize operation on (a) depicting that it has been compressed along its width to make it a square-shaped image.2.2.5.Dealing with Uneven Sample Sizes Table 2 presents the number of images of five different classes present in the APTOS 2019 BD dataset.As seen from the table, the dataset is imbalanced, or in other words, the associated classes do not have the same number of samples.The issue persists after removing the excessively noisy images and duplicate images.At this point, the sample size of the NO class (the majority class) is almost eight times larger than that of the Severe DR (SE) class (the minority class).Compared to No DR (NO), Mild DR (MI) and Prolific DR (PR) have significantly smaller sample sizes as well.There are several disadvantages to working with such an imbalanced dataset.Bias in decisions toward the majority class while judging new samples is the most acute one.To deal with this problem, we augmented the existing samples of the MI, SE, and PR classes to create new samples.Each image of the mentioned classes was rotated by 90°, 180°, and 270° to obtain sample sizes four times larger than the previous ones.As shown in Table2, this operation does not solve the issue entirely but reduces it to some extent.After augmentation, we have 5285

Figure 3 .
Figure 3. (a) A DR image with different numbers of pixels in height and width and (b) output of the resize operation on (a) depicting that it has been compressed along its width to make it a square-shaped image.2.2.5.Dealing with Uneven Sample Sizes Table 2 presents the number of images of five different classes present in the APTOS 2019 BD dataset.As seen from the table, the dataset is imbalanced, or in other words, the associated classes do not have the same number of samples.The issue persists after removing the excessively noisy images and duplicate images.At this point, the sample size of the NO class (the majority class) is almost eight times larger than that of the Severe DR (SE) class (the minority class).Compared to No DR (NO), Mild DR (MI) and Prolific DR (PR) have significantly smaller sample sizes as well.There are several disadvantages to working with such an imbalanced dataset.Bias in decisions toward the majority class while judging new samples is the most acute one.To deal with this problem, we augmented

Figure 4 .
Figure 4. (a) A sample DR image, (b) contrast-enhanced version of (a), and (c) the tone-mapped version of (b).

Figure 4 .
Figure 4. (a) A sample DR image, (b) contrast-enhanced version of (a), and (c) the tone-mapped version of (b).

28 Figure 5 .
Figure 5. (a) A processed DR image, (b) the red channel of (a), (c) the green channel of (a), and (d) the blue channel of (a) containing very little information.

Figure 5 .
Figure 5. (a) A processed DR image, (b) the red channel of (a), (c) the green channel of (a), and (d) the blue channel of (a) containing very little information.

Figure 6 .
Figure 6.The histogram of the processed retinal image is shown in Figure 5a.

Figure 8 .
Figure 8.The GLCM of the processed retinal image is shown in Figure 5a.

Figure 8 .
Figure 8.The GLCM of the processed retinal image is shown in Figure 5a.

Algorithm 3 :
Steps of feature selection using basic GA Input Output : : A set of features and labels, = < x 1 , y 1 >, . . .< x n , y n > Fitness function, { (•,•) Fitness threshold, τ Population size, p s A set of strong (the fittest) features

Figure 10 .
Figure 10.Comparison of the classification outcomes at different folds in terms of (a) accuracy and (b) F-measure scores based on different subsets of the prepared feature set.

Figure 10 .
Figure 10.Comparison of the classification outcomes at different folds in terms of (a) accuracy and (b) F-measure scores based on different subsets of the prepared feature set.

Figure 11 .
Figure 11.(a) Aggregated confusion matrix and (b) the receiver operating characteristics (ROC) curves of classification performance of the 8th fold.

Figure 11 .
Figure 11.(a) Aggregated confusion matrix and (b) the receiver operating characteristics (ROC) curves of classification performance of the 8th fold.

Figure 12 .
Figure 12.Classification outcomes at different folds based on (a) fold-wise selected features and (b) the Union set of selected features.

Figure 12 .
Figure 12.Classification outcomes at different folds based on (a) fold-wise selected features and (b) the Union set of selected features.

Figure A1 .
Figure A1.Sample images of the five DR classes collected from the APTOS 2019 BD dataset along with their class labels.

Figure
FigureA2presents a few samples excluded from the study in order to illustrate the extent of their decay.

Figure A1 .
Figure A1.Sample images of the five DR classes collected from the APTOS 2019 BD dataset along with their class labels.

Figure
FigureA2presents a few samples excluded from the study in order to illustrate the extent of their decay.

Figure A1 .
Figure A1.Sample images of the five DR classes collected from the APTOS 2019 BD dataset along with their class labels.

Figure
FigureA2presents a few samples excluded from the study in order to illustrate the extent of their decay.

Figure A2 .
Figure A2.Sample images excluded from the study due to excessive decay.

Figure A2 .
Figure A2.Sample images excluded from the study due to excessive decay.
and Qidwai outlined a et al. extracted deep features from the fundus images of the dataset and used

Table 1 .
DR classification performance comparison on the Asia Pacific Tele-Ophthalmology Society 2019 Blindness Detection (APTOS 2019 BD) dataset.

Table 1 .
DR classification performance comparison on the Asia Pacific Tele-Ophthalmolog tection (APTOS 2019 BD) dataset.

Table 1 .
DR classification performance comparison on the Asia Pacific Tele-Ophthalmology Society 2019 Blindness Detection (APTOS 2019 BD) dataset.

Table 2 .
The number (and percentage) of images of each class remaining after various steps of the workflow.

Table 3 .
Optimized values of the XGBoost parameters.

Table 5 .
Class-wise classification performance evaluation.

Table 5 .
Class-wise classification performance evaluation.

Table 6 .
Various parameters of the employed Genetic Algorithm (GA).

Table 7 .
Performance of the classifier with and without specific steps of the described method.