Diabetic Retinopathy Improved Detection Using Deep Learning

: Diabetes is a disease that occurs when the body presents an uncontrolled level of glucose that is capable of damaging the retina, leading to permanent damage of the eyes or vision loss. When diabetes affects the eyes, it is known as diabetic retinopathy, which became a global medical problem among elderly people. The fundus oculi technique involves observing the eyeball to diagnose or check the pathology evolution. In this work, we implement a convolutional neural network model to process a fundus oculi image to recognize the eyeball structure and determine the presence of diabetic retinopathy. The model’s parameters are optimized using the transfer-learning methodology for mapping an image with the corresponding label. The model training and testing are performed with a dataset of medical fundus oculi images and a pathology severity scale present in the eyeball as labels. The severity scale separates the images into ﬁve classes, from a healthy eyeball to a proliferative diabetic retinopathy presence. The latter is probably a blind patient. Our proposal presented an accuracy of 97.78%, allowing for the conﬁdent prediction of diabetic retinopathy in fundus oculi images.


Introduction
Diabetic retinopathy (DR) is a disease that affects the blood vessels present in the retina, which is damaged due to multiple alterations by a set of metabolic disorders [1].The blood vessels present damage in their capillaries due to the loss of pericytes, which are contractile cells that wrap capillary endothelial cells in the body's venules.Excess glucose molecules cause this damage in the blood, which clump together in the vessels disrupting circulation, a process known as ischemia.These blood vessels' deterioration produces microaneurysms, which is a saccular enlargement of the venous end of a retinal capillary by the lack of blood circulation.This process leads the vessels to lose their impermeability properties, resulting in leaks, such as hemorrhages or lipid sweating [2].
From ischemia, two major problems occur in the retina [3].The first problem is Vascular Endothelial Growth Factor (VEGF) cytokine protein production, which generates new blood vessels from existing ones, known as neovessels.The problem with this protein lies in the growth of the retina's surface and vitreous humor's surface.Since there is no blood circulation, these areas will only grow until they burst, producing bleeding in the vitreous cavity or pulling the retina apart.That tissue expansion results only in the patient's blindness.The second problem is plasma leakage.Due to lipid sweating, a fat accumulation begins to occur, altering the macula and leading to vision loss.
Diabetic retinopathy can be classified from the earliest to the most advanced stages once examined the retina's fundus condition.The disease presents two main categories: Non-Proliferative Diabetic Retinopathy (NPDR) and Proliferative Diabetic Retinopathy (PDR) [4].Moreover, NPDR presents three subcategories as slight, medium, and severe.
The damage of this first category is limited and does not go beyond the retina's inner limiting membrane.DR is proliferative when the ischemia damages result in blood vessels growing beyond the retina.PDR contains early, high-risk, and advanced sublevels.DR is developed when a patient has diabetes for at least 10 years without diagnosis and unaware of it.In this regard, the DR can be prevented if it is detected early enough by health check-ups and systematic treatment of diabetes [5].
Digital retina fundus images were used with computer vision technique [6,7] to automatically detect DR in different stages, and more recently, by using deep learning image processing method [8].Deep learning (DL) is a proven methodology that automatically extracts features from images processed by a layer stack of a convolutional neural network [9].These features can determine what is present in the image, and therefore it is useful for classification purposes.DL models showed a higher capability in recognizing objects than the human eye [10].However, this methodology still requires considerable data and computational resources to optimize the model's parameters.
In this work, we proposed a DL model to classify retina fundus images and detect the presence of DR in its different stages.The model was optimized using transfer-learning from DenseNet121 [11] to differentiate between a healthy eyeball and a proliferated one.Our proposal was tested over two datasets using a cross-testing method checking the complexity to acquire features from each.The model achieved better validation and testing results trained over APTOS dataset with an accuracy of 81% and 59% for each phase in predicting diabetic retinopathy presence in fundus oculi images.

Related Works
The earliest digital retina fundus images were classified using hand-engineering extracted features with empirical obtained parameters.One of these works was presented by Cree et al. [12].The authors have proven that computer vision techniques were suitable to automatically detect microaneurysm.Their experiments relied on simplistic morphological and thresholding techniques using eight features among pixel area and total pixel intensity, measured on each candidate.The proposed method achieved results comparable to those obtained by clinicians, proving that automated microaneurysm detection can be used for diagnostic purposes.
In more recent works, the methods were evolved to detect not only microaneurysm in the fundus images, but also the stage of diabetic retinopathy.Yun et al. [7] proposed a method to classify the retina fundus images as normal, moderate, severe, and proliferative DR.The input images were preprocessed using morphological operations with disc and diamond structuring elements.After that, six features were obtained considering the perimeter and area of the pixels in the RGB channels.For classification, a single-layer feedforward neural network with 8 units in the hidden layer was used.As input, 6 units were used for each feature value mentioned above and 4 units as output, one for each severity of DR.Nayak et al. [6] proposed a method implementing an image preprocessing with adaptive histogram equalization to be further processed by morphological operations and texture analysis to extract features about blood vessels.A multilayer perceptron processed the extracted features.The architecture consisted of 4 inputs considering the blood vessel area and perimeter, the exudate area, and texture.Inputs were processed by 2 hidden layers with 8 units each before the exit with 2 units to classify the for the normal retina represented by '01', the nonproliferative DR represented by '10', and the proliferative DR represented by '11'.Rosas et al. [13] proposed the recognition of a microaneurysm, using computer vision techniques to preprocess the image.Firstly, the nonuniform illumination was reduced, and the grayscale intensities were normalized to obtain two features.The first features used principal component analysis to discriminate the round-shaped candidates' region, and the second one used radon features to count the number set of discrete angle values.Those features were then passed to a hierarchical system of classifiers composed of two perceptron units to learn the threshold required to determine if there is a microaneurysm or not in the region.
Hand-crafted extracted features require specialized knowledge and empirical results to achieve accurate microaneurysm detection in digital images.In this regard, recent image processing advances were automated the feature acquisition stage from raw images to useful information using convolutional neural networks [14,15].Many classification tasks use deep learning methods with a large stack of convolutional layers to acquire features from the network's input.One first approach was presented by Gargeya et al. [16], where the authors automate the DR screening.The validation results obtained an AUC of 0.95 in the Messidor dataset using a 5-fold cross-validation technique.Another work proposed by Dutta et al. [17], presented a multilayer perceptron outperforming a convolutional neural network with 83.6%, using extracted statistical features such as average, median, standard deviation, maximum, and minimum.A computer-aided DR diagnosis system was proposed by Mansour [18] processing the image with background subtraction to further be processed by the AlexNet model and obtain features.These features are then processed by a super vector machine (SVM) algorithm using a 10-fold cross-validation technique.The results showed that the use of feature reduction before SVM processing improves the results reaching a 97.93% of validation accuracy.Qummar et al. [19] proposed an ensemble approach of 5 deep learning models, which performed well with unbalanced data, achieving 70% of validation accuracy.One recent approach was presented by Gadekallu et al. [8], who presented a deep learning model optimized by intelligent computing and the use of PCA, achieving an accuracy of 96%.Another work by Majumder et al. [20] proposed a real-time algorithm to be used in smartphones with an accuracy of 87.4%, focusing on a lightweight and efficient model.
From the before mentioned approaches was encountered the difficulty in differentiating the recognition between 4 DR stages.This challenge becomes visible after some works compare their proposals as binary classification between a healthy and unhealthy eye.Additionally, the lack of test datasets to prove the specificity of the model prepared by qualified personnel, as proposed in [21].

Deep Learning
Deep learning (DL) [9] enables computational models composed of multiple processing layers to learn data representations in multiple abstraction levels.These methods dramatically improved state-of-the-art speech recognition, pattern recognition, object detection, and many other domains [22,23].DL uncovers intricate pattern structure in large datasets using the backpropagation algorithm to indicate how a model should update its internal parameters to compute each layer's representation from the previous layer's representation.One challenge using DL models is the huge amount of data required to optimize the model.Different approaches address this issue through transfer-learning [24,25], avoiding the need for huge datasets.Transfer-learning [26] allows the optimization of previously settled model parameters over a new data distribution domain.This previous knowledge reduces the training time required to update the model's parameters in a related task under a new dataset domain.It is common for a model trained under a highly complex features dataset such as ImageNet [27] to be used for transfer-learning, given the most variated optimized kernels.This methodology also improves the model's accuracy in many other kind of tasks.

Diabetic Retinopathy
Depending on the stage of diabetic retinopathy, it can be classified into different classes, from the earliest to the most advanced.This classification depends on the condition of the retina in its fundus oculi evaluation, divided into two main categories, Nonproliferative Diabetic Retinopathy (NPDR) and Proliferative Diabetic Retinopathy (PDR) [15].
NPDR is subdivided into slight, medium, and severe, and the changes it produces are limited to the retina and do not go beyond the inner retinal limiting membrane.PDR can be subdivided into early, high-risk, and advanced, with ophthalmoscopic alterations produced by ischemia, result in blood vessels neoformation, which proliferate beyond the retina.For this work, PDR is used as a single class containing the early, high-risk, and advanced subtypes.

•
No DR: patient without retinal alterations due to diabetes.

•
Severe NPDR: two of three criteria from the medium NPDR exist in the fundus oculi .

•
Proliferative DR: proliferative retinal neovessels appear in the retina's image .

Fundus Oculi Proposed Classification
A methodology capable of processing retinal fundus images for early detection of diabetic retinopathy and its degree, if present, is proposed.For this purpose, a deep learning model with transfer learning is used to analyze different images that are previously prepared, obtaining an inference of how likely is the existence of diabetic retinopathy among 5 different classes defined as No NPDR, Slight NPDR, Medium NPDR, Severe NDPR, and Proliferative DR.

CNN Architecture
The proposed model is based on the DenseNet121 architecture [11] that receives an RGB image of 224 × 224 pixels.The weights were previously trained with the ImageNet dataset and used to obtain sufficient features to optimize the weights of the fully connected layer as output.The convolutional layers are also known as the feature extraction stage and the last layer as the classification stage.The model output presents a softmax activation to designate probabilities to each class.A special feature of DenseNet, is that each output of a convolutional layer is concatenated to the subsequent layers of the same block as shown in Figure 1.When the convolutional layers process the image, the features obtained are passed to the classification stage.Two fully connected layers are part of the classifier.Its first layer consists of 1024 units with a ReLU activation, followed by a dropout layer with a probability of 50% and another layer with 1024 units and ReLU activation.The last layer present 5 units with a softmax activation.

Retina Images Dataset
For this work, two datasets were used as a cross-testing approach to check the model's ability to acquire complex features for the different classes mentioned in Section 2.2 and the advantages of each dataset.The first dataset is APTOS, publicly available in the Kaggle platform (https://www.kaggle.com/c/aptos2019-blindness-detection/data,accessed on 12 Novovember 2020), containing 3662 labeled images for training and 1928 unlabeled images for testing purposes.The second dataset is available by registering on the site that manages the Messidor dataset (https://www.adcis.net/en/third-party/messidor2/,accessed on 12 February 2021), containing 1744 unlabeled images for training.
The Messidor dataset [28] comprises 872 DR examination images of two maculacentered eye fundus images from both left and right eyes.The ground truth information for the images was from a third-party source.All the images present the same format of the fundus oculi; in the center of a wide image, the eyes are presented with a black background.The APTOS dataset was published as a competition under the same platform, with fundus oculi images taken under different conditions and sizes.For this case, the dataset authors provide the ground truth data only for the training subset, hence, we do not use the testing subset.Both datasets are high-resolution images and with an unbalanced number of images for each of the 5 classes.More details about classes distribution are in Table 1.As preprocessing, all the images in both datasets were cropped square from the center, leaving just the most important part of the image, the retina's fundus.Additionally, they were resized to 224 × 224 pixels, defined to fit the pretrained DenseNet model's input, including the preprocessing method as proposed in their original work [11].

Experimental Results
The training of our proposal was optimized using the Adam algorithm with a learning rate of 1 × 10 −5 with the categorical cross-entropy loss function.In addition, all the model parameters were updated during 50 epochs using the early stop method to prevent overfitting.Both datasets were used for training, validation, and testing purpose implemented as cross-testing.Hence, if the APTOS dataset is used to optimize the model, 80% is used for training and 20% for validation, leaving the Messidor dataset for testing.Equivalently, if Messidor is used to optimize the model, the same proportion is used for training and validation, leaving the APTOS dataset for testing.
To compare the results between each dataset used for training, the accuracy, precision, recall, f1-score, the receiver operating characteristic (ROC) curve, and the area under the ROC curve are used.Tables 2 and 3 present the metrics for validation and testing results, respectively.In both Tables, DS means dataset, and in Table 3 only, the APTOSMess and MessAPTOS are the testing metrics for the Messidor dataset with the model trained over APTOS and vice versa, respectively.The model presents a higher accuracy trained with the APTOS dataset than Messidor in the validation subset with 81% and 64%, respectively.Considering the cross-testing, the model trained over APTOS reached a 59% accuracy recognizing the classes in the Messidor dataset, being better than the training over Messidor with 33% of testing accuracy in the APTOS dataset.Observing the metrics during validation, the model learns a better precision, recall, and F1-score from the APTOS than Messidor datasets for all the classes.However, the Severe NPDR class presented better results from Messidor, and the PDR label classification was not learned from the Messidor dataset.
In terms of testing results, the obtained metrics show that the model reaches a better precision trained over Messidor.Nevertheless, a better recall and F1-score is obtained for the model trained over APTOS.From the Figure 2 with the ROC curves, it can be appreciated that the model trained over APTOS learned more useful features for most of the classes than the trained over the Messidor dataset.The most difficult class was the Slight NPDR because the area under the ROC curve was almost the same for both datasets.A special issue in this classification task is the lack of a standardized benchmark dataset, establishing baseline results that must be used as a reference for new proposals.The Messidor dataset was previously used and presents a challenging data distribution, making it suitable as a benchmark dataset.In counterpart, the APTOS dataset presents a more variate data distribution suitable to optimize the model's parameters.Considering the images from Figure 3 and the values in Table 3, the standardized images from the Messidor dataset expose better features for each different 4 DR-related outputs.However, the data distribution from Messidor is not enough to generalize in different image representations.

Discussion
A baseline comparison was obtained using the work presented by Gadekallu et al. [8], where the model proposed presents a single output for a healthy or DR-affected eye.We mixed the 4 DR-related outputs into one unit to perform a suitable comparison with the baseline.The validation results were presented as a base metrics comparison because the authors do not specify the test dataset used to obtain the metrics in the previously mentioned work.In this regard was assumed that the presented values correspond to the validation subset, which comprises 20% of the first Messidor dataset version.Nevertheless, the models can not be compared directly due to data differences and lack of information about the label selected of the detailed metrics.The testing accuracy achieved by the model was 61% on both datasets, and the benchmark metrics are presented in Table 4, AM and MA are the testing metrics for the Messidor dataset with the model trained over APTOS and vice versa, respectively.The APTOS dataset presents a good precision performance for the unhealthy or DR-affected eye and good performance in recall and F1-score for the healthy label.However, the model trained over Messidor better recognizes features in a certain grade of DR in fundus oculi images.

Conclusions
In this work, we presented a model to detect diabetic retinopathy at early stages as an auxiliary diagnostic tool.The Messidor and APTOS dataset helped to check the model's feature acquisition for different stages of DR directly from examination images.Our proposal achieved a good performance under both unbalanced datasets with 81% and 64% for APTOS and Messidor datasets, respectively.A fair comparison of this work with previous methods is still missing given the lack of a standardized test dataset and methodology to propose a new solution to classify DR subclasses.A standardized dataset becomes necessary given two main reasons.The first one is related to the datasets sources and the institution which manages and ensures label integrity.The second one is related to the data used to get the validation metrics because each work presents a different validation data split; hence, the tested data between one model and another is not the same.Additionally, many data competitions have no public test labels, making it difficult to compute the test metrics.Nevertheless, we demonstrated that the APTOS dataset presents a more variate distribution of images to optimize the model acquiring better features to classify DR subclasses.
In future works, we propose to create a benchmark dataset to follow as a baseline comparison.Additionally, check different preprocessing functions and architecture, such as a multilabel approach to improve the model performance.The multilabel method comes from the idea that one kind of NPDR shares features between them.Moreover, a detection algorithm with a bounding box also will be studied.

Figure 1 .
Figure 1.DenseNet121 architecture with three dense blocks.Layers between two adjacent blocks are transition layers and change feature-map sizes via convolution and pooling.Adapted from [11].

Figure 3 .
Figure 3. Dataset image samples of Messidor (left) and APTOS (right).Both datasets were used in this work as a cross-testing method.From top to bottom are sample for each class.First one is a healthy eye.In middle are nonproliferative diabetic retinopathy subcategories, and last presents a proliferative diabetic retinopathy sample.

Table 1 .
Quantity of images per class present in each dataset.

Table 2 .
Metrics obtained during validation.APTOS dataset proved to perform better during validation in all presented metrics.Bolder values indicate best performance for a specific class between both datasets.

Table 3 .
Metrics obtained during testing.DS columns indicate datasets used for training and testing, where Mess means Messidor.Bolder values indicate best performance for a specific class between both datasets.

Table 4 .
Metrics obtained during testing as binary classification.DS columns indicate the datasets used for training and testing.In that order, A means APTOS, and M means Messidor.Bolder values indicate best performance for a specific class between both datasets.Class column Healthy comprises only No DR class, while Unhealthy comprises remaining ones to achieve a suitable comparison between our proposal and previous works.