Severity Grading and Early Retinopathy Lesion Detection through Hybrid Inception-ResNet Architecture

Diabetic retinopathy (DR) is a diabetes disorder that disturbs human vision. It starts due to the damage in the light-sensitive tissues of blood vessels at the retina. In the beginning, DR may show no symptoms or only slight vision issues, but in the long run, it could be a permanent source of impaired vision, simply known as blindness in the advanced as well as in developing nations. This could be prevented if DR is identified early enough, but it can be challenging as we know the disease frequently shows rare signs until it is too late to deliver an effective cure. In our work, we recommend a framework for severity grading and early DR detection through hybrid deep learning Inception-ResNet architecture with smart data preprocessing. Our proposed method is composed of three steps. Firstly, the retinal images are preprocessed with the help of augmentation and intensity normalization. Secondly, the preprocessed images are given to the hybrid Inception-ResNet architecture to extract the vector image features for the categorization of different stages. Lastly, to identify DR and decide its stage (e.g., mild DR, moderate DR, severe DR, or proliferative DR), a classification step is used. The studies and trials have to reveal suitable outcomes when equated with some other previously deployed approaches. However, there are specific constraints in our study that are also discussed and we suggest methods to enhance further research in this field.


Introduction
The data given by the World Health Organization (WHO) show that over 5 to 7 million people across the globe face vision impairment known as diabetic retinopathy (DR), which accounts for around 5-6% of world blindness as described in Figure 1. Timely detection can avert the danger of vision loss.
Automated categorization of cardiovascular and ophthalmologic infections ysis of fundus images has become a well-known exercise in the field of telemedic vious methods were composed of manual separation; however, it was tiresome, t suming, difficult, and skilled manpower is mandatory [1]. On the other hand, co aided identification of fundus irregularities is economical, realistic, impartial, a not need professionally trained ophthalmologists to categorize the fundus image provement in the screening methods is useful in early detection and real-time gr fundus diseases such as retinitis pigmentosa (RP), diabetic retinopathy (DR), bunker, age-related macular degeneration (AMD), retinoblastoma, retinitis pigm and retinal detachment [3]. A number of template-related, edge-based, and morph techniques have been used in recent years for autodetection of fundus developm fundus pathology [4]. Furthermore, numerous unsupervised as well as supervise networks (NN)-based approaches have also been used for fundus image examina merous supervised approaches have adopted artificial neural network (ANN), S cision trees (DT), and multilayer perceptron (MLP) [5]. Moreover, filter match model-based methods have also been analyzed for the resolution of unsupervised abnormality discovery [4,[6][7][8]. At present, identifying diabetic retinopathy (DR) is a time-intensive and man cedure that involves a qualified and skilled ophthalmologist to examine and a colored fundus images of the retina. The human observer most probably can su review report a day or two later; therefore, the late review report can lead to del low-up, misinterpretation, or delayed cure. An ophthalmologist can detect DR d occurrence of lesions linked with the vascular anomalies initiated by the disease technique is efficient, a number of resources are required. The skill and apparatu are often missing in the regions that have a high rate of diabetes in residents, w tection of DR is highly desired. With every passing day, as the cases of diabetes the facilities required to avert vision loss due to diabetic retinopathy will beco more inadequate.
Due to the lack of physical infrastructure and skilled human resources, the an automated and state-of-the-art technique of DR detection has long been ad and earlier efforts have shown good improvement using the classification of recognition of patterns, and machine learning (ML). By giving digital color fundu as input, the objective of our work is to establish an automatic computerized sy the detection of DR to every possible limit-preferably making a system with rea At present, identifying diabetic retinopathy (DR) is a time-intensive and manual procedure that involves a qualified and skilled ophthalmologist to examine and assess the colored fundus images of the retina. The human observer most probably can submit his review report a day or two later; therefore, the late review report can lead to delayed follow-up, misinterpretation, or delayed cure. An ophthalmologist can detect DR due to the occurrence of lesions linked with the vascular anomalies initiated by the disease. As this technique is efficient, a number of resources are required. The skill and apparatus needed are often missing in the regions that have a high rate of diabetes in residents, where detection of DR is highly desired. With every passing day, as the cases of diabetes increase, the facilities required to avert vision loss due to diabetic retinopathy will become ever more inadequate.
Due to the lack of physical infrastructure and skilled human resources, the need for an automated and state-of-the-art technique of DR detection has long been advertised, and earlier efforts have shown good improvement using the classification of images, recognition of patterns, and machine learning (ML). By giving digital color fundus images as input, the objective of our work is to establish an automatic computerized system for the detection of DR to every possible limit-preferably making a system with realistic on-ground potential. DR grades of severity levels are categorized by several lesions identified on the retina, such as microaneurysms, hard exudates, and hemorrhages [10]. By reading the literature, we find various techniques, including deep learning (DL) methods and models that were built for the identification of lesions and classification of DR, exhibiting encouraging results. Khojasth et al. [1] developed a technique of CNN that is used to identify 3 different lesions with the help of 12 layers associated with diabetic retinopathy (DR), i.e., microaneurysms, hard exudates, and hemorrhages. Lam et al. [11] used a method to localize the lesions with the help of various architectures of CNN, i.e., Residual-Net, AlexNet, VGG-16, InceptionV3, and GoogleNet. Eftekhari et al. [4] developed a model of CNN which has 10 layers to identify the microaneurysms lesion. Gersia et al. [6] built a model for the identification of DR that is based on data preprocessing trailed by relating the results of various models of CNN such as VGGNet16 or AlexNet etc. Yung et al. [5] built a model of CNN to categorize levels of DR from normal to severe grade. Pratt et al. [2] built a model of CNN for grading DR with the help of augmentation of data and data preprocessing. Qummar et al. [12] used various preprocessing stages and a combination of different CNN architectures, i.e., ResidualNet50, Inception-V3, Xception, and DenseNet, to categorize DR into appropriate levels. Kori et al. [13] used preprocessing trailed by a model, composed of five ResNets and three different DenseNets, for DR levels. These models have attained high accuracy. Though the developed DL models have shown impressive results in the grading of DR, additional work is required to enhance the precision of grading. Figure 2 represents the various stages of DR mild. DR grades of severity levels are categorized by several lesions identified on the ina, such as microaneurysms, hard exudates, and hemorrhages [10]. By reading the li ature, we find various techniques, including deep learning (DL) methods and models t were built for the identification of lesions and classification of DR, exhibiting encourag results. Khojasth et al. [1] developed a technique of CNN that is used to identify 3 differ lesions with the help of 12 layers associated with diabetic retinopathy (DR), i.e., micro eurysms, hard exudates, and hemorrhages. Lam et al. [11] used a method to localize lesions with the help of various architectures of CNN, i.e., Residual-Net, AlexNet, VG 16, InceptionV3, and GoogleNet. Eftekhari et al. [4] developed a model of CNN which 10 layers to identify the microaneurysms lesion. Gersia et al. [6] built a model for the id tification of DR that is based on data preprocessing trailed by relating the results of v ous models of CNN such as VGGNet16 or AlexNet etc. Yung et al. [5] built a mode CNN to categorize levels of DR from normal to severe grade. Pratt et al. [2] built a mo of CNN for grading DR with the help of augmentation of data and data preprocessi Qummar et al. [12] used various preprocessing stages and a combination of different CN architectures, i.e., ResidualNet50, Inception-V3, Xception, and DenseNet, to categorize into appropriate levels. Kori et al. [13] used preprocessing trailed by a model, compo of five ResNets and three different DenseNets, for DR levels. These models have attain high accuracy. Though the developed DL models have shown impressive results in grading of DR, additional work is required to enhance the precision of grading. Figur represents the various stages of DR mild.

Related Work
In this related work, we initially debate about various DR techniques for classifi tion. As the dataset which is used in the experimentation is extremely class-imbalanc and thus has a huge influence on the outcomes of the subjected models, we also disco various techniques in order to discuss this issue. Finally, we discuss the two architectu which directly inspire our model.
As microaneurysms and bleedings are usually the initial indications of diabetic r nopathy, numerous studies have been conducted on these afflictions, mainly for the ini identification of DR. Therefore, we emphasize those methods in our study. Numero works practice the traditional image processing techniques, and several of them are sho ened in Table 1. We have restricted our work in this table to the techniques that iden microaneurysms and bleeding DR, as their occurrence is important for the early recog tion of diabetic retinopathy (DR).

Related Work
In this related work, we initially debate about various DR techniques for classification. As the dataset which is used in the experimentation is extremely class-imbalanced, and thus has a huge influence on the outcomes of the subjected models, we also discover various techniques in order to discuss this issue. Finally, we discuss the two architectures which directly inspire our model.
As microaneurysms and bleedings are usually the initial indications of diabetic retinopathy, numerous studies have been conducted on these afflictions, mainly for the initial identification of DR. Therefore, we emphasize those methods in our study. Numerous works practice the traditional image processing techniques, and several of them are shortened in Table 1. We have restricted our work in this table to the techniques that identify microaneurysms and bleeding DR, as their occurrence is important for the early recognition of diabetic retinopathy (DR).

Earlier DR Classification Techniques
Studies on automated DR grading have been a vigorous domain in image processing in the medical field in recent decades [21]. Many scholars have suggested various techniques to categorize diabetic retinopathy (DR). These techniques can be largely categorized into three classes permitting various grading measures: the initial one is the binary, i.e., along with or without diabetic retinopathy; the second one is a 3-stage class of standard, proliferative, or non-proliferative diabetic retinopathy; the recent and most broadly used measure is the 5-class arrangement described and discussed in the previous section. Figure 3 represents the color fundus image with DR. used measure is the 5-class arrangement described and discussed in the previous section. Figure 3 represents the color fundus image with DR. For binary class, Gardnar et al. [7] used the strength of pixels as feedback features of the model and achieved specificity (Sp) and sensitivity (Se) results of 83.5% and 88.4% correspondingly on a comparatively small dataset of a few hundred photographs. Roy- For binary class, Gardnar et al. [7] used the strength of pixels as feedback features of the model and achieved specificity (Sp) and sensitivity (Se) results of 83.5% and 88.4% correspondingly on a comparatively small dataset of a few hundred photographs. Roychowdhury et al. [22] suggested a 2-step tiered classification method which combines the four machine learning (ML) techniques of SVM, Gaussian mixture models (GMMs), Ad-aBoost, and k-nearest neighbors (kNNs), and obtained 100% sensitivity results, 0.904 AUC and 53.16% specificity. Prieya et al. classified retinal photographs with diabetic retinopathy into NP-DR or PDR, in which they initially took out blood vessel features and features of hemorrhages as well as exudates. After that, these features were fed into the three grading techniques of SVM, probabilistic neural net (PNN), and lastly the Bayesian classifiers (BC), achieving accuracy of 97.6, 89.6, and 94.4%, correspondingly.
For 3-stage classification, Nayek et al. [23] used the properties, i.e., the area of blood vessels and exudates with textures, and fed these features into the neural network (NN). On a very small dataset of 140 photographs, they obtained precision of 93% specificity (Sp) and sensitivity (Se) of 100% and 90% correspondingly. Most of the modern techniques focus on 5-class classification. Achariya et al. [24] used top-order spectrum techniques to obtain features and classified photographs with the help of SVM. This technique achieved accuracy, specificity, and sensitivity of 82%, 88%, and 82%, respectively. Adarish et al. [25] used texture features and area of the infected part of the retina and trained the multiclass support vector machines (SVMs) for classification purposes. With the emergence of deep learning (DL) in previous years, Prat et al. [2] anticipated a model to categorize DR with the help of a 13-layer convolutional net and assessed the results of this network on a large dataset from Kaggle.

Class-Imbalance Feature Learning
In the real-world domain, imbalanced datasets exist, i.e., identifying unreliable telecommunication clients, classification of textual data and retrieval of information from medical imaging, and so on [3]. The majority of algorithms for classification purposes use an objective function that depends upon constant 0-1 and regularized loss function. Without solving the disparity issue, most of the techniques are inclined to be biased for many classes, having poor precision for the smaller classes [8].
Many solutions for the class imbalance problem have been suggested at the data level [3]. However, our proposed solution tries to solve the problem of class imbalance not even on the data level but at the algorithm level as well. For the data level, the proposed solution contains various types of resampling such as undersampling, oversampling, and their combination. At the level of algorithm, a cost-sensitive technique is being used, where every class receives a different weight during the calculation of loss function. In [26], the researcher trained a CNN model with the help of a cost-sensitive loss function to perform saliency identification.

ResNet and Deeply Supervised Nets
In recent years, deep learning (DL) has gradually become widespread in both industrial and academic domains. Different areas such as computer vision recognition of patterns and NLP have observed the considerable power of DNN.
In [27], the researcher presented companion functions in every hidden layer to address the following three problems present in the old-style CNN model: initially, the transparency in intermediate layers for an overall classification; secondly, the robustness and discrimination of learned characteristics, particularly in initial layers; and lastly, the effectiveness of training for vanishing gradient. The idea of a deep space network has been effectively applied to many computer vision problems, i.e., saliency, scene text, and edge identification. In [28], the researchers have shown a ResNet architecture that makes the training easy for very deep neural networks. This model clearly readjusts the layers of architecture as the residual learning function's place to input, as an alternative to the unreferenced learning function.

Methodology
As illustrated in Figure 4, the proposed method comprises three stages: a preprocessing level in which intensity normalization is applied, then augmentation of data, and lastly balancing of data and extraction of features through hybrid Inception-ResNet architecture Classification through Neural Network (NN) classifier. effectiveness of training for vanishing gradient. The idea of a deep space network has been effectively applied to many computer vision problems, i.e., saliency, scene text, and edge identification. In [28], the researchers have shown a ResNet architecture that makes the training easy for very deep neural networks. This model clearly readjusts the layers of architecture as the residual learning function's place to input, as an alternative to the unreferenced learning function.

Methodology
As illustrated in Figure 4, the proposed method comprises three stages: a preprocessing level in which intensity normalization is applied, then augmentation of data, and lastly balancing of data and extraction of features through hybrid Inception-ResNet architecture Classification through Neural Network (NN) classifier.

Dataset Collection.
Data collection is a significant stage that is often undervalued. The standard of the data given as an input to the system has a robust effect on the resultant performance of

Dataset Collection
Data collection is a significant stage that is often undervalued. The standard of the data given as an input to the system has a robust effect on the resultant performance of the proposed ML model. Thus, it is essential to thoroughly examine the available dataset and consider all potential problems that should be sorted out before going into the modeling phase.
The data will be collected from multiple online resources such as the IDRiD or Kaggle dataset. IDRiD is the Indian Diabetic Retinopathy (DR) Image Dataset. The dataset comprises 3662 labeled and 1928 test set unknown labeled fundus images of clinical patients, with five different severity level labels, i.e., normal DR, mild DR, moderate DR, severe DR, and proliferative DR (PDR). Figure 5 shows that the data are not balanced: 49% of data are related to the patient with no fundus disease. The other 51% of data represent the various stages of diabetic retinopathy. Class 3 is the least common class (severe), having only five percent of the entire image data.
The dataset is composed of several sources (clinics) with the help of several models of digital cameras, which generate divergences in the resolution of images, width-height aspect ratio, and other constraints which are illustrated in Figure 6, showing the aspect ratio of width and height of images.

Image Preprocessing
To streamline the classification process for the suggested model, it is essential to make sure that all the fundus images look alike. various stages of diabetic retinopathy. Class 3 is the least common class (severe) only five percent of the entire image data.
The dataset is composed of several sources (clinics) with the help of several of digital cameras, which generate divergences in the resolution of images, widt aspect ratio, and other constraints which are illustrated in Figure 6, showing th ratio of width and height of images.

Image Preprocessing
To streamline the classification process for the suggested model, it is ess make sure that all the fundus images look alike.
Initially, due to the usage of various cameras having dissimilar aspect ratios o images, as a consequence, in some fundus images, there are big black spaces aro fundus. These black regions do not hold any data related to predicting something they need to be cropped. However, the proportions of black regions differ in ever due to different camera sources. To sort out this problem, we design a special that transforms the image to grayscale and spots the black regions depending u intensity level of pixels. Then, we identify the mask by selecting the rows as we umns where the pixels surpass the threshold of the intensity level. By doing this move horizontal and vertical boxes that are filled with black color just like the tected in the top right of the image. Lastly, after eliminating the black strips, all the need to be resized to the same width and height. prises 3662 labeled and 1928 test set unknown labeled fundus images of clinical patients, with five different severity level labels, i.e., normal DR, mild DR, moderate DR, severe DR, and proliferative DR (PDR). Figure 5 shows that the data are not balanced: 49% of data are related to the patient with no fundus disease. The other 51% of data represent the various stages of diabetic retinopathy. Class 3 is the least common class (severe), having only five percent of the entire image data.
The dataset is composed of several sources (clinics) with the help of several models of digital cameras, which generate divergences in the resolution of images, width-height aspect ratio, and other constraints which are illustrated in Figure 6, showing the aspect ratio of width and height of images.

Image Preprocessing
To streamline the classification process for the suggested model, it is essential to make sure that all the fundus images look alike.
Initially, due to the usage of various cameras having dissimilar aspect ratios of output images, as a consequence, in some fundus images, there are big black spaces around the fundus. These black regions do not hold any data related to predicting something; hence, they need to be cropped. However, the proportions of black regions differ in every image due to different camera sources. To sort out this problem, we design a special function that transforms the image to grayscale and spots the black regions depending upon the intensity level of pixels. Then, we identify the mask by selecting the rows as well as columns where the pixels surpass the threshold of the intensity level. By doing this, we remove horizontal and vertical boxes that are filled with black color just like the ones detected in the top right of the image. Lastly, after eliminating the black strips, all the images need to be resized to the same width and height. Initially, due to the usage of various cameras having dissimilar aspect ratios of output images, as a consequence, in some fundus images, there are big black spaces around the fundus. These black regions do not hold any data related to predicting something; hence, they need to be cropped. However, the proportions of black regions differ in every image due to different camera sources. To sort out this problem, we design a special function that transforms the image to grayscale and spots the black regions depending upon the intensity level of pixels. Then, we identify the mask by selecting the rows as well as columns where the pixels surpass the threshold of the intensity level. By doing this, we remove horizontal and vertical boxes that are filled with black color just like the ones detected in the top right of the image. Lastly, after eliminating the black strips, all the images need to be resized to the same width and height.
The second problem is the shape of the eye. On the basis of structure of image, some people have an eye of circular shape, while some people's eyes seem like an oval shape. As the shape and size of output images after preprocessing positioned in the retina decide the severity level of disease, it is essential to normalize the shape of the eye as well. For this purpose, we design an additional function that helps us to crop an image from the center in a circular shape.
Lastly, with the help of a Gaussian filter, we correct the divergences of brightness and lightening of images by smoothing them.
Besides the above mentioned steps, the image preprocessing is executed by three steps; normalization (discussed above in detail), augmentation of images, and lastly the balancing of data (see Figure 7). Initially, the intensities of images are normalized between 0 and 1, by this formula: After the normalization step is completed, the process of augmentation is used to increase the data for training of the model, in order to enhance the standard of the training. This can be achieved by spinning every image around the y-axis. The illustration of the suggested preprocessing phases (normalization and augmentation of data) is displayed in Figure 7. Lastly, with the aim of proficiently training the proposed model, balancing of given data is essential, and this balanced data are provided to the proposed CNN model during the training of the model. In this step, the data are equally split per every stage of DR. This can help in eliminating any unfairness during the process of training the model. Hence, the suggested model trains the network with an equivalent size of fundus photographs, carefully chosen for every grade of DR randomly.

Deep Learning Inception-ResNet model
The proposed model comprises three phases, which makes this system efficient as compared to other state-of-the-art techniques discussed in this study: • Pretraining: As in medical imaging, the dataset is limited in numbers (N = 3662). That is why by using the transfer learning technique, we train our model twice, firstly on a larger dataset of ImageNet. Although the data of ImageNet are slightly different from the images of retina, the ImageNet dataset might help our network to learn and understand the edges and shapes in the first phase. However, to train the model about target domain of retinopathy in the second phase, we further train the proposed model on a larger dataset featuring around 35,000 retina images.  After the normalization step is completed, the process of augmentation is used to increase the data for training of the model, in order to enhance the standard of the training. This can be achieved by spinning every image around the y-axis. The illustration of the suggested preprocessing phases (normalization and augmentation of data) is displayed in Figure 7.
Lastly, with the aim of proficiently training the proposed model, balancing of given data is essential, and this balanced data are provided to the proposed CNN model during the training of the model. In this step, the data are equally split per every stage of DR. This can help in eliminating any unfairness during the process of training the model. Hence, the suggested model trains the network with an equivalent size of fundus photographs, carefully chosen for every grade of DR randomly.

Deep Learning Inception-ResNet Model
The proposed model comprises three phases, which makes this system efficient as compared to other state-of-the-art techniques discussed in this study: • Pretraining: As in medical imaging, the dataset is limited in numbers (N = 3662). That is why by using the transfer learning technique, we train our model twice, firstly on a larger dataset of ImageNet. Although the data of ImageNet are slightly different from the images of retina, the ImageNet dataset might help our network to learn and understand the edges and shapes in the first phase. However, to train the model about target domain of retinopathy in the second phase, we further train the proposed model on a larger dataset featuring around 35,000 retina images. • Fine-tuning: After training the model on ImageNet and the larger retinal dataset, we fine-tune the proposed model on our limited target image set. We make the decisions of modelling depending upon the results of out-of-the-fold estimations. • Interpretation: We sum up the predictions of the proposed model trained on various arrangements of training folds and also use the augmentation of test-time in order to further enhance the performance of the model. Therefore, the above steps enable us to conclude that firstly we initialize our model with the weights of ImageNet, then train this model on target domain of larger data and lastly fine-tune it on smaller datasets. Results show that our proposed model outperforms the other latest models on smaller datasets.
The suggested model uses the latest transfer learning (TL) approach with the help of the pretrained Inception-ResNet model, having 50 weighted layers. This pretrained Inception-ResNet model has four phases, every phase is made up of three layers of convolutions, and n times repeated (shown in Figure 6). Inception-ResNet is a CNN structure that is made up of the architecture of the Inception family but includes residual connection (switching the filter concatenation stages of the architecture of the Inception family). This property of the said architecture can help the Inception-ResNet technique to understand the global features specific to data. The comprehensive design of the proposed model is presented in Figure 8.
(switching the filter concatenation stages of the architecture of the Incept property of the said architecture can help the Inception-ResNet techniqu the global features specific to data. The comprehensive design of the pr presented in Figure 8. In order to implement the transfer learning (TL) method, the conv parameters are shifted without any modifications, and the layers that ar are changed with a specially designed classifier that contains four clas strating the four diabetic retinopathy (DR) stages. This specially desig trained with the help of labels obtained from the dataset of IDRiD.

Classification in First Phase
In the first phase, we examined two kinds of specially designed clas classify images of DR. The initial classifier is built on the basis of feedforw wise classifier, made up of a fully connected (FC) weighted layer among tor of Inception-ResNet architecture and the resultant layer that is mad nodes equivalent to the DR's stages. The second phase implements a twof classification, depending upon the particular feature. In order to train first phase, four labels are commonly used: normal DR, mild DR, mode vere/proliferative DR. Severe and proliferative DR have been categoriz label, "severe/proliferative DR (PDR)" throughout experimentation, bec larity of their fundus images with each other. In order to choose the b suggested model has been evaluated with the help of standard metrics fo the precision of classification.

Classification in Second Phase
Because of the similarities of the last two stages (severe/proliferativ original Inception-ResNet is not capable of differentiating between thes order to achieve the best accuracy, after the first phase classification, anot Inception-ResNet classification is included for this purpose. This phase In order to implement the transfer learning (TL) method, the convolutional layer's parameters are shifted without any modifications, and the layers that are fully connected are changed with a specially designed classifier that contains four class labels, demonstrating the four diabetic retinopathy (DR) stages. This specially designed classifier is trained with the help of labels obtained from the dataset of IDRiD.

Classification in First Phase
In the first phase, we examined two kinds of specially designed classifiers in order to classify images of DR. The initial classifier is built on the basis of feedforward (NN) pixel-wise classifier, made up of a fully connected (FC) weighted layer among the features vector of Inception-ResNet architecture and the resultant layer that is made up of multiple nodes equivalent to the DR's stages. The second phase implements a twofold kernel SVMs classification, depending upon the particular feature. In order to train the classifiers of first phase, four labels are commonly used: normal DR, mild DR, moderate DR, and severe/proliferative DR. Severe and proliferative DR have been categorized into the same label, "severe/proliferative DR (PDR)" throughout experimentation, because of the similarity of their fundus images with each other. In order to choose the best classifier, the suggested model has been evaluated with the help of standard metrics for evaluation, i.e., the precision of classification.

Classification in Second Phase
Because of the similarities of the last two stages (severe/proliferative DR (PDR)), the original Inception-ResNet is not capable of differentiating between these two classes. In order to achieve the best accuracy, after the first phase classification, another second phase Inception-ResNet classification is included for this purpose. This phase is trained offline in order to differentiate between these two (severe and PDR) classes, whereas the modified severe and PDR fundus images of the first phase are fed into it. The resultant of this phase is either "severe or proliferative DR".

Performance Evaluation Metrics
In order to estimate the different metrics of performance for the suggested model, different performance evaluation metrics for the classification's accuracy (CA) are used. CAg denotes the portion of the properly categorized fundus photographs for the particular grade "g", described as: CAg = correctly classi f ied grade g images total number o f grade g images (1) where g represents the grades of the diabetic retinopathy (DR), i.e., g ∈ ("normal to severe DR"). For example, if 80 from a total 100 normal photographs are categorized properly, then CA normal = 80%. The accuracy of CA of the whole system classification could be obtained as the individual's average of CAs for every stage.

Experimental Work
The comprehensive setup of the suggested model, associated outcomes, and results are discussed below.

Dataset and Experimental Setup
For the training of CNN's architecture of Inception-ResNet, the data are divided in a random manner into 30% testing and 70% training dataset. The process of training reduces the losses of cross-entropy with the help of a learning rate (LR) of 4-10 and 20 epochs at maximum.

First Phase Results
In order to examine the usefulness of every preprocessing step, i.e., normalization of data in the first place then augmentation of data and lastly the balancing of data in the suggested network, results of the network are matched for the initial classification phase, among the six categories: • Category A: There is no preprocessing step. Only uses the unprocessed images. • Category B: Simply the process of normalization is being used. • Category C: Normalization as well as augmentation. • Category D: Normalization as well as balancing of data. • Category E: Implementing the normalization of the dataset, augmentation as well as balancing of data. • Category F: Implementing the full normalization, augmentation as well as balancing. The last two categories include a three-step process.
For every category, outcomes are compared among the pixel-wise neural network classifier or a classifier of support vector machine (SVM). Such as shown in Figure 9 and Tables 1 and 2, the pixel-level neural network classifier delivers improved precision as compared to the support vector machine (SVM) classifier for each examined category (i.e., category A to F). The output of this system progressively improves by growing the utilized preprocessing stages, i.e., with the help of the suggested step of normalization of intensity, which increases the overall classification accuracy from 61.35% to 63.98%. Including balancing or augmentation, the step increases the performance to CA = 68.86% or CA = 69.34%, correspondingly. Lastly, including all the suggested stages increases the accuracy of the suggested system to CA = 84.62% or CA = 88.10%. These outcomes demonstrate the effect of the suggested preprocessing stages for enhancing performance.

Second Phase Results
A second phase Inception-ResNet model is included to differentiate between the two classes (severe/PDR), whereas the adjusted severe or proliferative DR fundus photographs of the initial phase are fed into it. In order to show the benefit of the suggested 2-step model for diabetic retinopathy classification, it is associated with further relevant studies on the similar IDRiD dataset. As illustrated in Table 2 and Figure 9, the performance of the system steadily improves by increasing the utilized steps of preprocessing. For example, with the help of the suggested step of intensity normalization, the classification accuracy increases from 61.35% to 63.98%. Including a balancing step or an augmentation process improves the classification accuracy from 63.98% to 68.86% or classification accuracy = 69.34%, correspondingly. Lastly, including all the anticipated steps increases the accuracy of the proposed model from classification accuracy = 69.34% to classification = 88.10%; these outcomes express the effect of the anticipated steps of preprocessing in order to enhance the performance. As shown in Table 2, the proposed 2-step Inception-ResNet model has shown better performance by implementing all the suggested preprocessing stages.

Analysis of the Performance and Complexity
The benefit of the suggested 2-step Inception-Residual-Net model is its capability to differentiate among all the categories with great performance, which is supported by the value of CA. Furthermore, as we train the proposed model offline, so the overhead to include the second phase Inception-Residual-Net is minimized, taking into consideration that the second Inception-ResNet phase is used only when the result of the initial stage is in the form of class "severe or PDR". We compare the results of our model with the current state-of-the-art work [10,[29][30][31][32] because this model follows the technique that has some resemblance with our proposed model. As in our proposed model, we try different combinations of preprocessing steps and come up with better results after adding the new step every time; this model also implements the preprocessing steps by bringing together eight models of CNN. These above mentioned models are composed of five different-sized ResNets (ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152) and three Inception Nets. As we can see, the above method uses both ResNet and Inception Net. So, it is wise to compare our model with the above model because our model also applies both Inception and ResNet models with fewer layers as compared to the largest ResNet152 of the above technique, which clearly shows that the training process of our proposed Inception-ResNet model has high performance and lower cost of computation. Furthermore, the testing time will be considered short. With the help of a 3.4 Intel CORE i3 and 8 GB of RAM and Python as the programming language, the suggested two-step Inception-ResNet model processes a testing fundus image and delivers its stage, usually an average of 1.32 s that includes 0.21 s for normalization of data and 0.41 s to calculate the first-phase DR stage and 0.41 to give the results of the second-phase DR stage. Table 3 shows the comparison of results between associated techniques. We are concerned with the problem in which our final output results are in the form of multiple classifications, so for loss function the use of cross-entropy is most suitable. Cohen's kappa is widely used for evaluation purposes. This evaluation matrix is usually used for the measurement of the agreement between predicted and actual labels. As this evaluation matrix is non-distinct, we cannot use the Kappa as a loss function. Simultaneously, we can use this result to measure the performance of the suggested system. We use an optimizer named "Adam" initially by using a learning rate (LR) of 0.002. Throughout the process of training, we use an LR scheduler that multiplies the LR with 0.5 after each five epochs' group. This assists in making slighter variations to the weights of the network as we are approaching optimization.
After every epoch during training, we authenticate the proposed model on the selected images. We take out the class score from the last fully connected (FC) layer and forecast the correspondence of image class to the maximum score. The training process is designed for 15 epochs, following the evaluation matrices such as Cohen's kappa and validation loss. The training process will be stopped if the kappa is unchanged for five successive epochs and weights of the model will be saved for the epoch related to the uppermost validation kappa. The visualization of loss and training is shown in Figure 10.
We also build a confusion matrix (CM) of the model that is trained in Figure 11. The values in the cells represent the percentages (%). The preliminary results show that the model is not well capable of distinguishing between the moderate and mild stages of diabetic retinopathy (DR): 85% of data in the mild category are categorized in the moderate category. Only the data of the normal category show an encouraging performance. As a whole, we conclude that the model is inclined to complicate the nearby severity classes but hardly misclassifies the proliferative DR as well as the mild category. We also build a confusion matrix (CM) of the model that is trained in Figure 11. The values in the cells represent the percentages (%). The preliminary results show that the model is not well capable of distinguishing between the moderate and mild stages of diabetic retinopathy (DR): 85% of data in the mild category are categorized in the moderate category. Only the data of the normal category show an encouraging performance. As a whole, we conclude that the model is inclined to complicate the nearby severity classes but hardly misclassifies the proliferative DR as well as the mild category. The process of fine-tuning is performed on the targeted data in four-fold validation. To make sure that we have sufficient samples of every category or stage, we perform cross-validation through the process of stratification. After the fine-tuning, the matrix shows the benefits of the tuned network over the pretrained convolutional neural network and shows an improved performance in categorizing the mild classes of diabetic retinopathy ( Figure 12). Yet, when the visualization of loss and training is shown in Figure 13 we also notice that this fine-tuned model categorizes numerous examples as class 2 (moderate).  We also build a confusion matrix (CM) of the model that is trained in Figure 11. The values in the cells represent the percentages (%). The preliminary results show that the model is not well capable of distinguishing between the moderate and mild stages of diabetic retinopathy (DR): 85% of data in the mild category are categorized in the moderate category. Only the data of the normal category show an encouraging performance. As a whole, we conclude that the model is inclined to complicate the nearby severity classes but hardly misclassifies the proliferative DR as well as the mild category. The process of fine-tuning is performed on the targeted data in four-fold validation. To make sure that we have sufficient samples of every category or stage, we perform cross-validation through the process of stratification. After the fine-tuning, the matrix shows the benefits of the tuned network over the pretrained convolutional neural network and shows an improved performance in categorizing the mild classes of diabetic retinopathy ( Figure 12). Yet, when the visualization of loss and training is shown in Figure 13 we also notice that this fine-tuned model categorizes numerous examples as class 2 (moderate). The process of fine-tuning is performed on the targeted data in four-fold validation. To make sure that we have sufficient samples of every category or stage, we perform crossvalidation through the process of stratification. After the fine-tuning, the matrix shows the benefits of the tuned network over the pretrained convolutional neural network and shows an improved performance in categorizing the mild classes of diabetic retinopathy ( Figure 12). Yet, when the visualization of loss and training is shown in Figure 13 we also notice that this fine-tuned model categorizes numerous examples as class 2 (moderate).
After training the model we will now offer some estimates regarding the test data. We combined these predictions with the help of the model trained through the loop of cross-validation. For this purpose, from the last FC layer, we take out class scores and express predictions of every class with the highest score. Then, we make an average of the predictions of the four models trained on alternative arrangements of the training folds. Tables 4-6 given below show the various arrangements of predictions.  After training the model we will now offer some estimates regarding the te We combined these predictions with the help of the model trained through the cross-validation. For this purpose, from the last FC layer, we take out class sco express predictions of every class with the highest score. Then, we make an averag predictions of the four models trained on alternative arrangements of the trainin Tables 4-6 given below show the various arrangements of predictions.   After training the model we will now offer some estimates regarding the test da We combined these predictions with the help of the model trained through the loop cross-validation. For this purpose, from the last FC layer, we take out class scores a express predictions of every class with the highest score. Then, we make an average of t predictions of the four models trained on alternative arrangements of the training fold Tables 4-6 given below show the various arrangements of predictions.

Conclusions and Future Work
This study presents a three-phase framework to automate the detection grading of DR. The suggested method is composed of preprocessing, extraction of feature, and lastly, classification. Experimental outcomes express that with the help of augmentation and balancing of data we can considerably enhance the performance of the system. Comparing the outcomes with other state-of-the-art work on the IDRiD dataset endorses the greater accuracy of the suggested system for grading of DR. In the future, there will be some ways to improve this method. Initially, the help of bigger architecture and enhancing the number of epochs during the training phase will have an extraordinary probability for improved outcomes. Simultaneously, this would need additional processing power and resources, which seems to be impractical when we use an automated system in real practice. Secondly, the preprocessing methods of target image data could be further enhanced. Lastly, the stateof-the-art results of other researchers are dependent on the assembling of CNN architectures by variations in sizes and architectures. Combining several dissimilar networks and joining together their expected results could also enhance the suggested solution. Our idea is to examine the fundus images with the help of other deep learning techniques to enhance the performance and we will also use other databases and datasets to verify the strength of the suggested system. Author Contributions: S.Y. and N.I. proposed the research conceptualization and methodology. The technical and theoretical framework was prepared by T.A. and U.D. The data preprocessing was performed by M.I. and A.A. The technical review and improvement were performed by A.R., S.A., and F.B. The overall technical support, guidance, and project administration were performed by A.G. and F.B. The editing and, finally, proofreading were performed by K.P., F.B. and L.W. All authors have read and agreed to the published version of the manuscript.