Untangling Computer-Aided Diagnostic System for Screening Diabetic Retinopathy Based on Deep Learning Techniques

Diabetic Retinopathy (DR) is a predominant cause of visual impairment and loss. Approximately 285 million worldwide population is affected with diabetes, and one-third of these patients have symptoms of DR. Specifically, it tends to affect the patients with 20 years or more with diabetes, but it can be reduced by early detection and proper treatment. Diagnosis of DR by using manual methods is a time-consuming and expensive task which involves trained ophthalmologists to observe and evaluate DR using digital fundus images of the retina. This study aims to systematically find and analyze high-quality research work for the diagnosis of DR using deep learning approaches. This research comprehends the DR grading, staging protocols and also presents the DR taxonomy. Furthermore, identifies, compares, and investigates the deep learning-based algorithms, techniques, and, methods for classifying DR stages. Various publicly available dataset used for deep learning have also been analyzed and dispensed for descriptive and empirical understanding for real-time DR applications. Our in-depth study shows that in the last few years there has been an increasing inclination towards deep learning approaches. 35% of the studies have used Convolutional Neural Networks (CNNs), 26% implemented the Ensemble CNN (ECNN) and, 13% Deep Neural Networks (DNN) are amongst the most used algorithms for the DR classification. Thus using the deep learning algorithms for DR diagnostics have future research potential for DR early detection and prevention based solution.


Introduction
Diabetes, also called diabetes mellitus, is a disease that occurs when the human body produces a large amount of blood glucose. Glucose a the main source of energy and comes from the food that the human body consumes daily. Diabetes has become the cause of many diseases such as heart, stroke, nerve damage, foot problems, gum disease, and many more [1]. The eye is also one of the major organs affected by diabetes. The diabetic problem associated with the eye is called Diabetic Retinopathy (DR). Comparison of a normal and DR eye is illustrated in Figure 1 inspired by [2]. DR is the primary cause of blindness, mostly in adults [3]. It disturbs the retina, a light-sensitive part of the eye, and can become the cause of blindness if it has not been diagnosed in the early stages, or not treated well. Chances of suffering DR become higher if the duration of having the disease is extensively over the threshold. A patient with a history of 20 years of diabetes has an 80% chance to encounter DR [4]. It is also eminent that the patient of DR may have no symptoms or only have a mild vision problem. Any damage or abnormal changes that occurred in the tissue of the organ caused by the disease is called a lesion. A consensus of international experts [5] formulated International Clinical Diabetic Retinopathy (ICDR) and Diabetic Macular Edema (DME) Disease Severity Scales to simply standardize the DR classification. ICDR improves the coordination and communication among physicians for caring for patients with diabetes [6,7].

Retinal Lesions
Microaneurysms (MA), Hemorrhages (HM), and soft exudates and hard exudates (SE and HE) were frequently observed by ophthalmologists in the retinal fundus images [8]. The presence of these lesions are pathognomonic signs of Diabetic Retinopathy (DR).
(i) Microaneurysms: The earliest symptoms of DR is the presence of microaneurysms in the retina of diabetic patient. They may present in the group of tiny, dark red spots or look like tiny hemorrhages within a retinal light-sensitive area. They vary in size, mostly between 10 to 100 microns but less than 150 microns [8][9][10]. In the early stage, it is not threatening to the eyesight but needs to be diagnosed and treated early.
(ii) Hemorrhages: The primary symptom of DR is leakage of blood in the retina of the eye that may appear anywhere in the retina, with any shape and size [11]. Hemorrhages are of many types such as flame hemorrhages or dot and blot hemorrhages.
(iii) Soft Exudates: Soft exudates are also known as cotton wool spots [12]. These are often pale yellow or buttery in color and round or oval-shaped due to capillary occlusions that caused permanent damage to the function of the retina.
(iv) Hard Exudates: This is the most advanced abnormality of the DR. These can be in different sizes from small tiny fragments to larger reinforcements and become the cause of visual impairment by foiling light from reaching the retina [8]. Figure 2 demonstrates the lesions in retinal images. Automatic detection of these lesions could help in early DR monitoring and control of progression in an efficient and optimal way. For many decades researchers have put their efforts towards building an effective method through which they can identify different abnormalities by achieving higher accuracy. However, extracting accurate features is a challenging task. Moreover, the selection of accurate machine learning (ML) algorithms at early stages lead to better results in DR early detection and thus can consider the foremost step of pervasive solutions. All algorithms have their pros and cons, but no method can be regarded as superior in all stages [13]. Recently, deep learning networks have achieved great attention due to their ability to automatically extract features and strong abstract representation learning. Like machine learning, this paradigm is also based on learning from the data, but instead of using handcrafted features, which is a very difficult and timeconsuming task, DL provides many advantages such as improvement in the performance of training data and the exclusion of the clumsy feature extraction process.
With the huge development of computing power and digital imaging, the algorithmic revolution to detect the severity of DR has improved over the past 5 years. A significant amount of work has been already achieved for automatic DR detection with the use of fundus photographs, but only one study has been procured to date [14]. Previous studies and reviews have discussed traditional methods of DR detection [9,15,16], but expert knowledge is required for hand-engineered characteristics of this highly sensitive investigation with several parameter settings. The last era of computer-based medical applications and solutions has provided a huge amount of data and the hardware capabilities to process these dataset. For instance, intensive increases in graphics processing units (GPUs) have encouraged many researchers to apply computer vision and image processing to achieve optimal performance. These operations have gained the interest of many researchers and market contributors over traditional methods. The major categorization of the work completed in this paper is divided into three main sections: (i) overview of different tasks to diagnose DR, (ii) preparation of a summary of publicly available dataset, and (iii) providing an outline of DL methods for blood vessels, Optic disc segmentation, detection, and classification of various DR lesions. However, this review does not embody the queries we tend to address during this study. A comparison of these studies has been presented in Table 1.

Purpose of This Study
This article presents a descriptive review to explore the significant work related to detecting DR by classifying different stages of DR, for instance mild, severe, and PDR, by using deep learning techniques. The proposed study also focuses on the public as well as private dataset used in the primary selected studies. Although in the past several literature reviews have been presented for compiling the deep learning in DR detection [14,19,20], some limitations have been found in these studies; these are being addressed in this study.
Contributions: This study is piloted systematically and follows the practice of the Cochrane methodology to synthesize the evidence by randomized trials for addressing the research questions [21]. The primary aim of this study is to provide the following state-of-the-art features: • A comprehensive overview of DR grading protocols and taxonomy to provide a general understanding of DR. • To identify the need for automatic detection of DR and ascertain how deep learning can ease the traditional detection process for early detection and faster recovery. • To conduct the meta-analysis of 61 research studies addressing the Deep learning techniques for DR classification problems and quality assessment evaluation of these studies. • To identify the quality research gaps in the body of knowledge in order to recommend the research directions to address these identified gaps. To the best of our knowledge, this study is the first work that focuses on preparing the taxonomy of DR problems and identification of the deep learning techniques for DR classifications based on the grading protocols that have efficient and highly accurate results.
The rest of the paper has been structured in the following manner: Section 2 outlines the methodology used to conduct this study. It includes three phases: planning, conducting, and documenting; moreover, the classification criteria to categorize studies according to scoring criteria is also part of this section. Section 3 describes the mapping study and findings which discussed the most used deep learning algorithms as well as discussing publicly available dataset to diagnose DR stages. Section 4 includes the discussion section and explains the validity threats that could affect this study. The last section, Section 6, provides a conclusion with suggestions and future work.

Systematic Literature Review Protocol
We have developed a research protocol for identifying, planning, synthesizing, and screening all the published work in the proposed research questions of this review. The main idea of this review is to discover the body of knowledge for using deep learning in DR. For this review, we have adopted the guidelines of the Cochrane methodology as a reference [21]. According to the author, the study is composed of three basic steps (i) planning, (ii) conducting, and (iii) documenting the review. Thus, we have developed the research protocol inspired by the their methodology, presented in Figure 3. The main objective of our study is designated to explore well-reputed digital libraries for perusing the answers to research questions using search terms.

Planning Phase
The initial phase of the study is to devise the protocol for review in a rigorous and well-described manner to preserve the primary rationale of the study. In this phase, we have prepared the review procedure from identification, evaluation, and interpretation of all available resources for deep learning in DR. Moreover, research objectives, research questions, and searching processes have been the main segments of this part of the research.

Research Objectives
The objectives of building this study are:

Research Questions
In order to gain an in-depth view of the domain, the Research Questions (RQ) have been presented in a Table 2 with their resultant motivations. These questions will help us in classifying the current research in detecting DR by deep learning and also provide the future directions of the research domain.

RQ1
What clinical features of retinal image are required for DR detection and classification, and which deep learning methods are mostly used to classify DR problems?
To answer this question, an overview of the current trends in deep learning in DR classification has been reviewed. The answer to this research question will help researchers in selecting the best deep learning technique to use as a baseline in their research.

RQ2
Which DR dataset have been acquired, managed, and classified to identify several stages of DR?
Identify available dataset to help researchers to use dataset as a benchmark and can compare the performance with their work.

RQ3
What are the open issues and challenges of using deep learning methods to detect DR?
This question will allow researchers to recognize open research challenges and future directions in order to detect DR by using more advanced deep learning techniques.

Search Process and String
There is a need for a well-managed plan to prepare for the searching process and narrow down the targeted results for focused solutions and answers. For the effective resultant approach, we have executed automatic and manual searching strategies from well-reputed repositories such as IEEE Xplore, Science Direct, Arxiv, ResearchGate, ACM, LANCET, PLOS, PubMed, and SpringerLink. After a closed and rigorous review, the following search string has been prepared for automatic retrieval of search studies from the selected search engines and demonstrated in Figure 4 as: ("diabetic" AND "retinopathy") AND ("Computer" OR "Automate*" OR "assisted" OR "early detect*" OR "identifi*" OR "screening") AND ("deep" OR "learning" OR "neural network") However, to retrieve the most relevant research papers, we have performed the primary research retrieval on selected search engines to find adequate results. Furthermore, for optimal retrieval, the search terms need to be bound, because many research terms can falsify the results due to the extended term of machine learning as a base. Therefore, there is a need to limit the coverage of the term to deep learning only. The search conditions for the searching process have been described as: • Retrieval of relevant keywords in deep learning for DR that can satisfy all research questions of this study; • Identification of domain synonyms and related terms; • Formulation of state of the art search string consisting of a key and substantial terms with "AND" or "OR" Boolean operators.

Conducting Phase
The second phase of the research methodology of this study is the execution of the search. It deals with the collection of all the existing research papers and studies in deep learning for DR. However, before the systematic literature review, we have created the inclusion and exclusion criteria for the study with data extraction and synthesis strategy.

Inclusion/ Exclusion Criteria
The preliminary step after selecting all papers from well-reputed repositories and search engines is to eliminate redundant titles and all irrelevant papers. Based on Table 3, the retrieved papers that follow any of the exclusion criteria have been excluded.

Study Selection
This study aimed to identify all potential papers that were the most significant to the objectives of this systematic review. To eliminate redundancy of inclusion, we have ensured to consider the papers once rather than they be retrieved from different search engines or repositories. Every selected paper was reviewed based on reviewing the title, abstract, keywords, and conclusion. Table 4 demonstrates the details about the number of searchers found in the digital libraries and selected research papers after applying inclusion/exclusion criteria.

Quality Assessment Criteria
In the final set, each publication was assessed for its quality. The quality assessment was performed during the data extraction phase and ensures that selected studies must be a valuable contribution to the study. Hence, to fulfill this purpose, a questionnaire inspired by [22] was designed. These criteria are labeled as (a), (b), (c), (d), and (e). The final score of each paper is calculated by assigning the individual score for these criteria and accumulative in last. The quality assessment and scoring criteria are represented in Table 5.

. Data Extraction Strategy and Syntheses Method
Data extraction strategy has been defined in Table A1 in Appendix A to collect the relevant information which was required to address research questions defined in this systematic mapping study. Data extraction ID from DE1 to DE3 collects the basic information related to the papers. These data extraction features include study identifier, publication type, source of the publications, and the publication location. The remaining IDs from DE4 to DE6 were extracted after studying the papers. Figure 5 shows the step by step procedure for searching the relevant studies for this study. We retrieved a total of 5120 studies after searching in the selected digital libraries. In the first step, we eliminated studies that were not relevant to the research questions as listed earlier and have also removed redundancy. After the initial step, rigorous manual screening based on the title, abstract, and content of initial screened studies was performed. The studies focused only on detecting lesions of diabetic retinopathy resulted in the exclusion of 755 studies. In the last step, only 61 studies were selected for further consideration. Importantly, 90% of the selected studies were published in JCR-listed journals and 10% in proceeding conferences. Figure 6 represents the year-wise publication status with 16% papers selected from 2016 and 2017, 10% from 2018, 31% from 2019, and 20% from 2021.

Classification Criteria
The classification criteria were divided into categories that were established through selected primary studies as below.

Documenting Phase
After conducting the study of the study, we presented the analysis of the results in a comprehensive format and also developed the meta-analysis in Tables 6-9 in order to present the information in a precise, comprehensive, and easy to understand format.

Mapping Results and Findings
The step wise process of examination, review, and extraction is presented in Figures 5 and A1. These figures demonstrated step by step observation and assessment of selected studies. After profound examination, six more papers were eliminated due to less relevancy with the proposed research questions of this study. Finally, 61 studies were selected after automatic and manual screening and evaluation. This section aims to review, analyze, and present the selected studies to answer the research questions listed in Table 2. Each research question is addressed individually along with the in-depth review and study. Tables 10 and 11 represent the quality assessment of each selected paper with respect to the quality assessment criteria presented in Table 5. The results show the area of study and research has potential for future enhancements and optimization in health care solutions concerning image processing and deep learning. To answer RQ1, deep learning techniques have been reviewed that were used to detect DR in its early stages. Furthermore, architecture, tools, libraries, and performance matrices have been analyzed and reviewed. However, before adding the description of all deep learning methods, there is a need to understand the clinical features and taxonomy required to detect DR. In the subsection below, we have comprehended some of the important DR clinical features and have also presented the Lesions and DR grading protocols. Furthermore, the taxonomy of DR detection classification is demonstrated in     In no DR, an eye is clear from any problem caused by DR. In mild diabetic retinopathy, the retina of the eye has a balloon-like swelling called micro-aneurysms, and this is the early stage of the disease. As the disease progresses, it converts into a more severe stage called moderate diabetic retinopathy in which blood vessels nurturing the retina of the eye may swell and distort, thus causing the retina to lose its ability to transport blood and become the reason for the change in the appearance of the retina. In severe DR, many blood vessels have been blocked, causing non-supply of blood to the retinal areas.
The most advanced stage of the DR is Proliferative Diabetic Retinopathy (PDR) and it may cause sudden blindness due to the vitreous hemorrhages of the central hemorrhages. If DR is not detected and treated properly in its Severe DR stages, it could turn into PDR [13]. Demonstrations of different stages of DR are shown in Figures 7 and 8 presents the taxonomy of DR concerning the severity of the problem. The summary of diabetic retinopathy grading protocols is illustrated in Table 12 and nomenclature is presented in Abbreviations.

Clinical Features of Retinal image for DR Detection (1) Retinal Blood Vessel Classification
Retina blood vessels are the anatomical and complex structure of the human eye and only be observable by the non-invasive method. The structure of these vessels and their changes reflect the impact of DR stages and help to understand the prodigality of the disease. Moreover, a cataract is an overcast and thick area of the eye. It starts with the protein transformation in the eye which makes the image of any object foggy or frosty, and also causes pain. It is graded into four classes, normal, mild, moderate, and severe [82]. Based on the cataract clinical understanding, the vessel segmentation and classification is tedious work due to several features of vessels such as the length, width, position, and tortuosity of branches. Therefore, automatic DR detection for vessel segment understanding and classification is necessary for DR early detection and recovery. Several deep learning studies have been undertaken to achieve accurate and better methods and models concerning retinal blood vessel classification. For instance, Deep Supervision and Smoothness Regularization Network (DSSRN) was proposed for deep observation of the regular network of blood vessels. DSSRN was established by VGG16 and Conditional Random Fields were used to check the smoothness of the results with higher accuracy but minimum sensitivity [84]. Patch-based ensemble models have been used on DRIVE, STARE, and CHASE models with optimal accuracy and sensitivity. Moreover, there are several classification methods used for understanding the transference of semantic knowledge between the output layers of sides. The ResNet-101 and VGGNet CNN models were used for experiments and have achieved better accuracy on STARE and Drive Dataset. However, these models cannot be used for micro-vessel segmentation with wider angles [50,64]. Moreover, in another study, the CNN model was integrated with 12 variations to differentiate vessel and nonvessel segmentation. DRIVE dataset was used for process evaluation [84]. DCNN was used on three dataset, DRIVE, STARE, and CHASE, for vessel segmentation by extensive pre-processing features, i.e., normalization and data augmentation by geometric feature transformation [65]. Regression Analysis was applied with VGG pre-trained model on vessel images by modifying VGG layers on DRIVE and STARE dataset [85]. The performance analysis of all the proposed segmentation models is presented in Tables 6-9. (2) Optic Disc Feature Segmentation The optic disc feature can be extracted by considering the conversion and contrast as parameters and then developing the DR diagnostic algorithm. Importantly, localization and classification of optic disc segmentation are two well-known operations to detect DR. During the survey of this study, it has been observed that boundary classification, edge detection, and orbital approximation are major tasks in DR deep learning optic disc feature classification. It has been observed that patch-based CNN, multi-scale CNN, FCN, and RCNN are majorly used deep learning methods for segmentation and localization. The detailed description and performance analysis of this feature is presented in Tables 6-9.

(3) Lesion Detection and Classification
Referable DR identification and classification is not possible without information about lesions. However, state-of-the-art machines can identify DR without lesion information, and thus their predictions dearth clinical clarification rather than having high accuracy. However, it has been observed that recent enhancements in visualization methods have encouraged researchers to work intensely in DR detection. Thus, generic heat maps and gradient models are a major contribution to lesion detection and classification to date. Hence, multiple lesions have still not achieved acceptable accuracy. During the study, we intensively reviewed lesion-based DR. In general, we retrieved 23 papers for DR lesion classification explaining the micro-aneurysms (11 studies), hard exudates (9 articles), and macular edema and hemorrhages (3 studies). The details of these studies are described in Tables 6-9.

Deep Learning Methods for DR Problems (1) Convolutional Neural Network (CNN)
Convolutional neural network (CNN) is a category of neural networks heavily used in the field of computer vision. It consists of hidden, input, and fully connected output layers. It derives its name from the convolutional layer, which is part of the hidden layer. The hidden layer of the CNN typically consists of convolutional, pooling, normalization, and fully connected layers [54,86,87]. The convolutional layer is the main building blocks of any CNN architecture. CNN is the most common type of architecture used in detecting DR in its early stages due to its neural and hierarchical structure. Authors in [23,29,39,43,88] demonstrate the effectiveness of CNN for five-stage classification of DR, and authors in [39,40,44]  Experiments with a small fraction of data can achieve state-of-the-art performance in referable or non-referable DR classification.
(2) Deep Neural Network (DNN) DNN has also shown a remarkable performance in many computer vision tasks. The term "deep" generally refers to the use of a few or several convolutional layers in a network, and deep learning refers to the use of the methodologies to train the model to learn essential parameters automatically using data representation to solve a specific domain of area of interest [10,55,72,90]. The major benefit of this algorithm is that with the increased number of samples in the training set, classification accuracy also increases. This algorithm has been used in [25,30,35,44,46,49,51] to improve classification accuracy among different DR stages. In [40], the author builds a DCNN model with three fully connected layers to automatically learn the image's local features and to generate a classification model.

(3) Ensemble Convolutional Neural Network
An ensemble of two or more different techniques to find better training results is not a new technique in machine learning. For example, upgrading the decision tree model to the random forest model. In deep learning, an ensemble of neural networks is performed by collecting different networks with the same configuration, and initial weights are assigned randomly to train some dataset. Every model makes its predictions. In the end, the actual prediction is calculated by taking the average of all predictions. The number of models in the ensemble method is kept as small as three, five, or ten trained models due to the computational cost and due to the lessening returns in performance. Moreover, ensemble techniques have been used to improve classification accuracy and model performance [14,26,32,37,38,50,53,55,63]. Based on the primary selected studies, some other DL methods have been applied to detect a DR problem by using different data sets. These algorithms are graph neural network, Hopfield neural network, deep multiple instance learning, BNCNN, GoogleNet neural network, and densely connected neural network. Figure 10 helps us to answer RQ1. The chart shows that researchers mostly prefer to use CNN architecture (31%) for detecting complex stages of DR. On the other hand, Deep CNN (20%) also gains good attention and results as well. Based on the final selected studies, three major algorithms CNN, DCNN, and ensemble CNN, were commonly implemented for DR detection. Moreover, for red-eye lesion detection, YOLOv3 has also been proposed [76] by using CNN and DCNN. Deep belief networks, Recurrent Neural Networks (RNN), autoencoder-based methods, and stacked autoencoder methods have also been used for DR classification problems. However, some researchers [26][27][28]32,37,46,63], ensemble CNN variations to achieve better performance and results. The author in [35] used the Adaboost algorithm to combine ensemble CNN models and presented the proven optimization with higher accuracy. Moreover, the authors in [40,50,53,60,61,73,80,91,92] tend to combine two or more CNN models to empower the training process in the field to detect DR. The in-depth analysis with major contribution of each selected study has been presented in Tables 9 and 10.

Research Question 2: Which DR dataset Have Been Acquired, Managed, and Classified to Identify Several Stages of DR?
To answer RQ2, we thoroughly studied each final selected study. Each of the 61 studies used different sets of private or publicly available data to validate their results. This section started with reviewing the most popular publicly available dataset for the task of detecting different severity levels of DR. Thereafter, we focus especially on those dataset those were used in the selected studies by the researchers. Figure 11 represents the rapid increase in the DR data set for classification and segmentation [87].

Publicly Available Dataset
There are many data sets publicly available online, according to the eye conditions such as glaucoma, age-related macular degeneration, and diabetic retinopathy. These publicly available dataset aim to provide good retinal images for training and testing purposes. These dataset are very important for the enhancement of science as they lead to the development of better algorithms as well as support technology to transfer from research laboratories to clinical practice. The descriptions of these dataset are enlisted in Table 13. Moreover, there are some reasons for providing dataset publicly. (1) Sharing and availability of data will encourage researchers to process and analyze said dataset, thus will lead to better solutions in DR-based applications. Diversity and availability of dataset can have remarkable results. In previous studies, it has been observed that, due to the unavailability of relevant data, many solutions were not able to achieve their appropriate goals. However, these solutions can have ultimate results by associating with the relevant and updated dataset.
(2) Almost all kinds of research in the same area of study require the same type of data; it would not be necessary to replicate data if these dataset are available publicly and can be beneficial in an economic and financial term, as well as saving time and effort. (3) Publicly available data and their re-processing and re-use for diverse analyses are fundamental resources for innovation. The research and experiments on these dataset can provide new insights and opportunities for researchers to collaborate with medical practitioners to develop potential and data-driven applications. Moreover, these dataset will statistically empower the solutions and encourage multidisciplinary researchers to come up with new analyses by using their expertise, thus leading to comparative analyses and optimal solutions.
(i) DRIVE Dataset: DRIVE abbreviated as Digital Retinal Images for Vessel Extraction (DRIVE) is a publicly available dataset, which consists of 40 color fundus images used for any automated vessel detection algorithms. DRIVE images were taken from the Netherlands during the DR screening program. The total screening population consists of 400 patients aged 25-90 years. From these data, 40 images were selected randomly; 33 images showed no signs of diabetic retinopathy but 7 images showed early signs of DR such as exudates and hemorrhages. Each image is in the form of a JPEG compress. All these images were acquired using a canon CR5 camera. Moreover, the set of these 40 images was equally divided into 20 images for training and the other 20 images for test data. For the training set, a single manual segmentation is available for the test dataset and two manual segmentation are available for vasculature. This process of the extraction of the blood vessel or segmentation from the retinal image is the task many researchers tried to automate as it is a very difficult and time-consuming process. The studies [37,64,65,[67][68][69][70]73,73] used the DRIVE dataset for their research work.
(ii) STARE Dataset: Structured Analysis of the Retina (STARE) was initiated by Micheal Goldbaum at the University of California. It contains 20 digitized images captured with TopCon TRV 50. The STARE website provides 20 images for blood vessel detection with labeled ground truth and 81 images for optic disc localization with ground truth. The performance of the vessel detection is measured using ROC curve analysis, and where sensitivity is the correct, classified proportion of blood vessels and specificity is the proportion of correctly classified normal pixels. In the case of the optic disc, localization performance is measured against the correctly localized optic disc, and localization is successful if the algorithm detects optic disc within 60 pixels from the ground truth [27,64,65,68,72] have used the STARE dataset for results validation.
(iii) MESSIDOR Dataset: Messidor is the largest dataset publicly available online that contains 1200 eye fundus color retinal images. It was acquired by the three ophthalmology departments using a color video 3CCD camera. All images are stored in the format of TIFF. Eight hundred images were acquired with the dilation of the pupil and the other 400 were without dilation. Its primary aim is to analyze the complexity of diabetic retinopathy by evaluation and comparison of algorithms. The severity of DR is measured based on the existence and number of diabetic lesions and also from their distance to the macula. Many studies for instance [10,32,35,39,40,45,46,[57][58][59]65,68,70,79,89] have used the Messidor dataset for deep learning and classification.
(iv) DIARETDB Dataset: This is a publicly available dataset for the detection of diabetic retinopathy from fundus images. The current dataset consists of 130 images from which 20 images show no sign of retinopathy but the remaining show signs such as exudates, hemorrhages, and micro-aneurysms. All images were captured using a 50-degree field of a view fundus camera with unknown camera settings. The main aims of designing this dataset are to define data unambiguously and testing, which can be used as a benchmark for diabetic retinopathy detection methods. DIARETDB has a further three levels DIARETDB 1, DIARETDB 2, and DIARETDB. Refs. [32,60] used DIARETDR dataset.
(v) EYEPACS Dataset: EyePACS consist of nearly 10,000 retinal images and provides a free platform for retinopathy screening. All images were taken under a different variety of retinal conditions. Every subject provides the left and right fields. The clinician-rated each image according to the presence of diabetic retinopathy. Images for this dataset are taken from different type and model of camera. Due to the large size of the dataset, it has been divided into separate files with multi-part archives such as train.zip, test.zip, sample.zip, etc. [10,40,46,58,65,68,70,78] have used this dataset in research.
(vi) KAGGLE Dataset: The dataset (Kaggle, 2015) by Kaggle consists 82GB image files, which contain a total of 88,702 color fundus images. Each image was rated by an ophthalmologist for the presence of lesions, using severity levels from 0 to 4. Level 0 contains images with no sign of retinopathy and level 4 shows images with advanced stages of retinopathy. The training set is highly unbalanced, with 25,811 images for level 0, 2444 images for level 1, 5293 images for level 2, 874 images for level 3, and 709 images for level 4. This dataset contains challenging images, which are of poor quality and unfocused, which makes it difficult for any algorithm to classify them correctly according to the severity of retinopathy. Many researchers, for instance, [10,22,26,32,35,40,44,46,53,77,90] have used the Kaggle dataset.
(vii) Others: Instead of using only publicly available dataset, some researchers [23,24,[28][29][30][31]33,34,36,37,39,40,47,55,60,72,74,80,81,93] have preferred to not use public dataset to classify DR stages. And few researchers [40,50] have used two dataset, one public and another private, to create their new dataset. To answer RQ3, section A provides detailed information about the existing dataset that was mostly used to detect two to five stages of DR. Figure 8 shows that, other than only using public dataset, some researchers prefer to collect their dataset (34%) either by collecting from the screening process or different hospitals or research laboratories. The Kaggle dataset (29%) also gains more attention to detect all five stages of DR. In 20% of the cases Messidor, 9% EyePACS, and only 2% DIARETDR, STARE, and DRIVE, dataset were used. Table 13 shows some of the publicly available dataset description format and dataset download links.

DR Detection Challenges
After thoroughly reviewing and processing the final selected papers, we have highlighted some of the main challenges faced by the researchers. Generally, these challenges include the size of the dataset, dataset bias, dataset quality, computation cost, and ethical and legal issues.
(i) Dataset Bias: The annotation work is based on the experience of the trained ophthalmologists or physicians on how they grade the images [28,40,58,83]. As a result, the algorithm may perform differently when an image with missing variables (microaneurysms, exudates, as well as important biomarkers of DR) that the majority of clinicians could not identify fed into the network, which may lead to various errors in the results [20,30,42,46,94].
(ii) Dataset Quality Issue: The quality of the DR imaging dataset is mainly affected by how the dataset is collected, i.e., quality of the camera, type of the camera, and overexposure to light [33]. A low-quality camera may be missing important information, and dark dots caused by the camera dust may become the cause of misclassified DR lesions [30,42,58].
(iii) Computational Cost: Deep neural networks are computationally rigorous, multilayered algorithms consisting of millions of parameters. Therefore, the convergence of such algorithms requires more computational power as well as running time [29,38,95]. Although there is no strict rule about how much data are required to optimally train the neural network, experiential studies recommend that tenfold training data produce an effective model. Furthermore, training high quality and a large number of images requires a powerful GPU, which is sometimes a very cost-effective issue [48].
(iv) dataset Size: Managing dataset size is also a common challenge faced by many researchers. Algorithms applied to a small dataset may not be able to accurately identify the severity grades of DR, and choosing too large data may lead to over-fitting [29,31,46,48].
(v) Interpretability: The power of the DL methods to map complex, non-linear functions makes them hard to interpret. Deep learning algorithms are like a black box in which the algorithm automatically extracts discriminative features from the images and associated grades [23,26,36]. Therefore, the specific features chosen by the algorithm are unknown. To date, understanding the features used by deep learning to perform calculations are an active area of research [49,83].
(vi) Legal Issues: Medical misconduct rules govern standards of clinical practice to show proper care to their patients. However, until now, no standard rules have been established to assign blame in contexts where the algorithm provides bad predictions and poor recommendations for treatment. The creation of such rules is an essential prerequisite for the worldwide adoption of deep learning algorithms in the medical field.
(vii) Ethical Issues: Some DR imaging data sets are publicly accessible for researchers. However, retrieving and collecting private dataset without formal agreement produces several ethical issues, particularly when the research contains sensitive information about the patient.

Open Issues
Deep learning is a promising technique that can be used for diagnosing DR. However, researchers should pay attention while selecting the DL technique. DL models often require a large number of dataset to train the model efficiently, which are not publicly available or sometimes require permission from the hospitals or research labs to gain access. This restricts the number of researchers who can work in this field for those who are based at large academic medical centers where this dataset is available and for the most part eliminates the core deep learning community that has an essential algorithmic and theoretical background to advance the field. Moreover, used training data becomes an essential part to check the performance of the model; it is currently almost impossible to compare new approaches that were proposed in the literature with each other if the training data are not shared publicly while publishing the manuscript. Additionally, sometimes while training the DL model, there is not a clear methodology to understand how many layers should be added and which algorithm is used to find appropriate results for detecting DR automatically. Setting hyper-parameters while training the model also affects the performance of the model. It is difficult to judge which setting of the hyper-parameters to choose. Sometimes after retraining, parameters in the model become the cause of a decrease inaccuracy [39,94,96].

Discussion
This study used Deep learning techniques to detect the main five stages of DR. Taxonomy of several severity levels of DR according to the number of lesions present in the retinal fundus image is demonstrated in Figure 9. Due to the high damages of DR, the need for such diagnostic tools is a major requirement that can diagnose DR automatically with less involvement of experts. This literature review applies the systematic approach to estimate the diagnostic accuracy of DL methods for DR classification in patients with long-term diabetes. As compared to previously tested machine learning techniques, an advantage of DL includes an extensive reduction in the manpower required for extracting features by hand, as DL algorithms learn to extract features automatically. Moreover, incorporating DL algorithms might enhance the performance of the automatic diagnostic of the DR system.
One study achieves the highest Area Under the Curve (AUC) score of 100% [52]. All the mentioned studies reach the highest performance rate, demonstrating the potential for using DL approaches to detect DR efficiently, while keeping in mind that the included studies vary concerning target condition, the dataset used for training, and the reference standards. At some point, comparison of studies is possible between 6 studies from all 39 selected studies, as they use the same validation dataset (Kaggle) to detect five-stage severity levels of DR by using CNN [21,22,38,42,83].
From all these studies, one study [21] achieves the highest accuracy score of 80.8% on Kaggle and [42] reaches 92.2% specificity and 80.28% sensitivity at a high-sensitivity set point. The CNN model's accuracy is considered as a strong reference standard because most of the experts expected to improve the performance and accuracy to diagnose the disease correctly as compared to detecting disease by the human grader. Nevertheless, some aspects are still changing, and it is difficult to say if the high-performance score is due to an algorithm or a combination of multiple features. Besides using the Kaggle dataset, ten studies used the Messidor-2 dataset to validate their results and achieve the highest performance scores.
DL algorithms can easily be implemented into a screening program in numerous ways. Further leading to the limitations of the primary studies, it is worth noticing that the DL model could be only validated using high-quality images and also needed a huge number of images to train the model e.g., Messidor-2, STARE, and DRIVE dataset. The images present in the dataset are not necessarily of good quality, including noise and distortion which could lead to misclassification of the model's performance. Secondly, the majority of the involved studies used privately collected dataset that are not publicly accessible by the researchers to validate their results, so others could not validate their performance accuracy. Lastly, training a large number of images needs GPU to run an algorithm that is sometimes not very cost-effective [22,28]. Given the above-mentioned limitations, there is a need to make improvements before completely replacing the manual system with CAD tools. While our results are based on evidence-based methodology, this study still has some limitations. Firstly, the search strategy could have been more sensitive, running search string on search engines to collect studies that belong to JCR journals and CORE conferences. Moreover, performing a snow-balling approach to collect papers may have led to the inclusion of some more studies. Lastly, having incomparable studies, it was difficult to make an expectation of how the DL algorithm would perform in the real world. The purpose of this study is to provide an increased focus on the development of different DL algorithms and their performance to classify various stages of DR as the included studies show great promise.
The focus of this study was only detecting different stages of DR using deep learning techniques and not focused on lesions detection, DME, or other retinal diseases. In the future, it would be quite interesting to see whether combining DR classification with other detection techniques improves the performance of the algorithm or not. Moreover, studies should include sensitivity, AUC, specificity, confusion matrix, false positive, and false negative values, because these are important to access patient safety.
In conclusion, deep learning is currently the hottest topic (as it can be seen from the selected studies that the majority of the work has been completed in 2019) and the leading method for automatically classifying DR in patients having long-term diabetes based on diagnostic performance. Even though CAD systems cannot completely outperform humans, it is suggested to take advantage of deep learning and build semi-automated deep learning models to be used in screening programs to overcome the burden of ophthalmologists.

Threats to Validity
This study is conducted in May 2021, so studies that appeared after that date would not have been captured. Reflecting on this methodology, this study has the following limitations.
Construct Validity: The search string is built using keywords that can extract all relevant papers, but there is a possibility that the addition of some extra keywords may alter the final results. The search string for finding results was run in IEEE explore, ACM digital library, PLOS, ResearchGate, Science direct, ArXiv, PubMed, and Springer. These digital libraries were considered a major source of data extraction in our area of interest. We did not run a research string on Google Scholar to find any related study. We believe that most of the studies to detect DR automatically can be found in these digital libraries, including all ranks of conferences and journals.
Internal validity: Internal validity deals with data analysis and the extraction process. Some studies signify overlapping contributions and areas of focus; generally, it has also been noted that one study influences and focuses only a single component of the research area. For example, improvement in the previously published study is improved by adding one or more features [22,28]. In such cases, a study that claims contribution in the area of focus has been considered for categorization. In some studies, different titles were used for the DR detection; but after reading the whole paper, those were categorized based on the DR severity stages.
External validity: Papers from various source engines were examined and selected by the authors who used the university's access to the database to extract data, but some papers might have been ignored due to the limited access to digital libraries. This thread was managed by requesting the full article from the original authors but some authors did not reply. While shortlisting articles that were most relevant to the area of focus, some papers might be discarded due to the presence of one or more exclusion criteria. The time limit was introduced in the search for the published studies. However, the representation of the selected studies might be affected.
Conclusion validity: Threats to conclusion validity were managed by a clearer representation of each step of systematic study. While examining the study, it is put into consideration that no identification of incorrect relationships is made that may lead to incorrect conclusions. In the author's opinion, a slight difference particular to some publication misclassification and selection bias would not change the main conclusions which were drawn from the 61 studies selected in our systematic study.

Future Directions
This study has revealed the fact that deep learning can be useful in DR detection and classification, but it still has several aspects to be open for research. It has been observed that many deep learning and advanced computation techniques have been used for the solutions of DR problems. However, due to the gap in the medical and technical field, the interpretation and accuracy of solutions concerning clinical understanding are still under discussion. Therefore, the result of this study did not include clinical accuracy and effectiveness. Another challenge is designing the deep-learning models, because these models required a huge amount of dataset. As described in the previous section, the availability of a dataset is no more a problem, rather the image annotations are one of the major issues because expert ophthalmologists are required to develop accurate annotation and thus lead the better deep-learning models. Data augmentation samples are also required for real-time DR dataset, and the development of knowledge-based data augmenting features is required for robust deep learning techniques. Moreover, image classification for DR problems requires a variety of samples for fundus images, but the available dataset are not appropriate for every kind of DR problem. Particularly, these images create a bias for DR classifications, thus this is another research direction to create morphological lesion variation and extensive data augmentation methods for preserving and classifying different kinds of DR problems. The summary of DR-related research issues and their solutions is presented in Table 14.

Conclusions
Diabetic retinopathy is the leading cause of severe eye conditions that can damage the eye retina vessels and cause vision problems. Late detection of DR problems can harm retina vessels dangerously, thus can lead to blindness. However, early detection of DR problems can prevent the potential damage. Computational methods, including image processing, machine learning, and deep learning, have been used for DR early detection. Deep learning has opened a new paradigm for designing and developing the models for the identification of DR complications including segmentation and classification. Numerous deep learning techniques have been implemented to detect different complex stages of diabetic retinopathy. In this study, we have performed a review to accurately present the state of the art in deep learning methods to detect and classify DR. Then, we have presented the taxonomy and DR grading schemes with an in-depth analysis of all the related DR problems. Moreover, we have identified major challenges and accomplishments of the studies. Eight major publication portals were used to search the relevant articles. The analysis and review have presented the main DL techniques used for multi or binary classification of DR. Further, we have provided a thorough comparison of studies based on used architecture, performance, and tools. We also have highlighted the primary data sets used by the community of researchers according to the dataset type, image type, and source. Lastly, we have provided an insight into the open issues and challenges researchers face while detecting diabetic retinopathy automatically with the help of DL methods, which also provide potential future research directions in this area. Generally, the deep learning methods have overtaken the traditional manual detection methods. This study has provided a comprehensive understanding of importance and future insights of using deep learning methods for early detection which can suggest the patients according to the severity of the problem as 35% studies used convolutional neural networks (CNNs), 26% implemented the ensemble CNN (ECNN), 13% of the studies have used Deep Neural Networks (DNN) for DR Classification. The in-depth analysis of algorithms used for DR classification also have been presented in this study. This review gives a comprehensive view of state-of-the-art deep-learning based methods related to DR diagnosis and will help researchers to conduct further research on this problem.

Conflicts of Interest:
There is no conflict of interest for this research.

Abbreviations
The following abbreviations are used in this manuscript:  Figure A1. The obtained results from the execution of the study filtering process.