Transfer Learning with Convolutional Neural Networks for Diabetic Retinopathy Image Classification. A Review

: Diabetic retinopathy (DR) is a dangerous eye condition that affects diabetic patients. Without early detection, it can affect the retina and may eventually cause permanent blindness. The early diagnosis of DR is crucial for its treatment. However, the diagnosis of DR is a very difficult process that requires an experienced ophthalmologist. A breakthrough in the field of artificial intelligence called deep learning can help in giving the ophthalmologist a second opinion regarding the classification of the DR by using an autonomous classifier. To accurately train a deep learning model to classify DR, an enormous number of images is required, and this is an important limitation in the DR domain. Transfer learning is a technique that can help in overcoming the scarcity of images. The main idea that is exploited by transfer learning is that a deep learning architecture, previously trained on non ‐ medical images, can be fine ‐ tuned to suit the DR dataset. This paper reviews research papers that focus on DR classification by using transfer learning to present the best existing methods to address this problem. This review can help future researchers to find out existing transfer learning methods to address the DR classification task and to show their differences in terms of performance.


Introduction
Diabetes mellitus (DM) is a chronic, metabolic, clinically heterogeneous disorder in which prevalence has been increasing steadily all over the world [1]. It is estimated that 366 million people had DM in 2011; by 2030, this will have risen to 552 million [2]. DM is characterized by persistent hyperglycemia, which may be due to impaired insulin secretion, resistance to the peripheral actions of insulin, or both, which eventually leads to pancreatic beta-cell failure [3]. People living with DM are more vulnerable to various forms of both short-and long-term complications due to metabolic aberrations that can cause damage to various organ systems, leading to the development of disabling and life-threatening health complications, the most prominent of which are microvascular (retinopathy, nephropathy, and neuropathy) and macrovascular complications [4].
Diabetic retinopathy (DR) is one of the most common microvascular complications that is caused by DM, and it happens when the blood vessels inside the retina are affected by high blood levels [5]. DR can create some irreversible complications that can lead to blindness in many cases. The number of patients that suffer from DR was estimated at 126.6 million in 2010, and this number is expected to grow to 191 million by 2030 [6]. More than 2.6% of blindness worldwide happens because of DR [7]. This percentage corresponds to a significant number of persons whose quality of life is severely affected. Though the early diagnosis of DR can help prevent blindness [8], this is a challenging task. More in detail, the main challenge of early-detected DR is the workforce that is needed to examine the retina images to detect DR [9] because diabetic patients must be assessed by an ophthalmologist at least once a year to detect the early signs of DR. Therefore, a reliable detection technology is needed to assist health care personnel in analyzing DR. According to Wilkinson et al. [10], DR can be classified into five grades: grade 0 is normal with no sign of DR, grade 1 means the presence of mild DR, grade 2 means moderate, grade 3 means severe, and, finally, grade 4 is defined by new vessel proliferation, where risks of vision loss include bleeding into the vitreous and tractional retinal detachment. Figure 1 shows the different grades of DR. Deep learning belongs to the broad family of machine learning methods [11]. Differently to traditional neural networks-based classifiers, deep learning builds classifiers with many hidden layers, aiming at identifying the salient low-level features of an image [12]. In the context of deep learning, transfer learning is a technique that exploits the usage of features that were learned by a network over a given problem to solve a different problem in the same domain. Transfer learning has many advantages. First, it saves computational time because, instead of training a new model from scratch, it makes use of the information that is already available from the last training process. Second, it extends the knowledge it acquired from previous models, and third, transfer learning is very useful when the size of the new training dataset is small. Transfer learning promises valuable contributions to the fields of computer vision, audio classification, and natural language processing.
There have been many attempts to automatize the image classification task-either to facilitate the process or to make it more accurate. One of the earliest attempts was the convolutional neural network (CNN), which was introduced by [13] for the image classification task.
In 2012, thanks to the work of Krizhevsky et al. [14], CNNs became the most popular technique for addressing the image classification problem. The authors achieved state-of-the-art performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competition [15], outperforming other commonly used machine learning techniques. CNNs can be used in image classification, as well as natural language processing [16][17][18] and time series analysis [19,20]. In all of these cases, training the weights of the deep network from scratch requires a substantial amount of time and huge datasets (hundreds of thousands of images). These requirements make deep learning algorithms very challenging in the context of medical images where, typically, only a limited number of images are available. A lot of time and experience are required to annotate medical images, and that is where transfer learning can play a significant role: It allows for the use of a pre-trained architecture that was previously fitted to images of the same domain.
Thus, transfer learning is particularly suitable for addressing the DR classification domain, where there is a lack of images to accurately train a CNN from scratch.
Several studies have been done to classify DR by using CNN, either by using transfer learning or by introducing novel architectures [21][22][23][24][25], but to the best of our knowledge, there have not been any reviews that survey the existing transfer learning techniques to classify DR images. To answer this call, in this paper, we discuss state-of-the-art DR image classification models that use the transfer learning of deep CNNs. Moreover, we discuss some important open questions to better apply transfer learning in the DR domain.
More in detail, we discuss state-of-the-art models and techniques that were published from 2015 to mid-2019. We used the following descriptors: "diabetic retinopathy," "convolutional neural networks," "transfer learning," and "image classification" to cover the primary studies that address the classification of DR images by using transfer learning. These keywords were entered into the most well-known academic databases, namely Scopus and PubMed.
Two filters were used to produce the results: The first filter excluded any paper that was not about DR, which reduced the results from 172 papers to 31 papers; the second filter excluded any paper that was not about transfer learning, which resulted in 18 papers that were about transfer learning applied to DR. This paper is organized as follows: Section II gives an overview of a CNN structure. Section III discusses various CNN architectures that are commonly used in transfer learning. Section IV provides a brief description of the main DR datasets that are available for public use. Section V provides a review of papers on the usage of transfer learning in classifying DR. Section VI presents the discussion, while Section VII presents open research questions. Finally, Section VIII concludes the paper.

Convolutional Neural Networks and Transfer Learning
CNN layers can be classified into two categories: primary layers and secondary layers. The primary layers are the main layers that are used in the CNN and consist of convolution layers, activation layers, pooling layers, flatten layers, and dense layers. Secondary layers are optional layers that can be added to make CNNs more robust against overfitting and increase their generalizability. They include dropout layers, batch normalization layers, and regularization layers. Figure 2 shows a CNN structure.

Convolution Layers
The first and most important layer in a CNN is the convolution layer, which can automatically extract the image features without the need to manually define these features. The convolution layer can be defined mathematically by: where the convolution is the integral of the pointwise multiplication of two functions after one of them has been reversed and shifted [26]. From Equation (1), the . function is the filter that is used. It is then reversed and slides along to the . function, where . is the input function. The area of the intersection between the two functions, . and . is the convolution value. In a CNN, the filters are not reversed but instead used as-is. The filter used, . , can be expressed as a grid of order . Usually, the numbers inside the filter are initialized randomly, and then these numbers are learned during the training process of the network. The result of the pointwise multiplication between the filter . and the input function . is saved in a new matrix called the output feature map. Figure  3 represents the differences between the convolution, the filter, and the output feature map. The steps that are performed by the filter function over the input function define the stride parameter. The stride can be formally defined as the amount by which the filter function . moves at each step over the input function . . Usually, after the convolution operation, the output feature map will have smaller dimensions than the input function. One can rely on the use of padding, a technique that adds zeroes around the input signal to maintain the original size, to maintain the dimensions of the output map and to prevent it from shrinking. Padding can be defined as the number of zeros that are added to the input function to control the spatial size of the output feature map throughout a network, especially deep networks. Figure 4 represents an input function with zero-padding. The convolution operation output depends on the input size, the used filter size, the used stride, and the padding. The output feature map size is calculated as follows: where filter size is , input dimension is , stride is , and padding is .

Activation Layers
Activation layers, nonlinear layers that usually follow the convolution layers, play an important role as a selection criterion that decides whether a selected neuron will fire. The input of the activation layer is a real number that is transferred by the application of a non-linear function. The activation layer is important because it allows the network to learn nonlinear mappings to make it more robust against complex functions. The most common activation layers that are used in CNNs are sigmoid, Tanh, ReLU, LeakyReLU, and softmax. The activation layers can be classified into saturated activation layers and non-saturated activation layers. If the output of the activation layer ranges between finite boundaries, then it is classified as saturated; otherwise, if it tends to infinite, it is considered a non-saturated activation function. The non-saturated activation functions have many advantages compared to saturated activation layers. For instance, the non-saturated layers can significantly help in the exploding/vanishing gradient problem of the backpropagation algorithm [27], which is one of the main problems when training a CNN. Different activation functions are shown in Figure 5.

Sigmoid Function
A saturated activation layer, which is a different form of a logistic function where the input is a real number and the output is a number in the range of [0,1], can be defined by 1 1

Tanh Activation Function
The hyperbolic tangent function is a saturated activation layer that is commonly used when a negative gradient is important. It outputs a number in the range of [−1, +1]. The following formula defines it:

ReLU Activation Function:
The rectified linear activation layer [28] is considered one of the most important activation layers in a CNN. It is a non-saturated activation function that is mainly used to remove any negative values. It is very useful in a CNN because it eliminates any negative gradients when the threshold is at zero.

LeakyReLU Activation Function
A leaky rectified linear activation layer [29] is a non-saturated activation function that allows some negative gradients to pass. It is used to reduce the effect of the negative gradients by factor . , 0 , 0

Softmax Activation Function
Softmax is an activation layer that is usually at the end of a network, and it produces a discrete probability distribution vector.
where X is the input vector and is the predicted probability of .

Pooling Layers:
Pooling layers are usually between consecutive convolution layers to progressively reduce the spatial size of the representation to reduce the number of parameters and computation in a network. A pooling layer reduces the output feature map of the convolution layer by extracting important pixels and removing noise. In this work, we assumed that the measurements were not noisy, and, if this were not the case, a de-noising procedure would be necessary [30]. Additionally, a pooling layer is used to strengthen network spatial invariance [31]. The two main parameters of the pooling layers are the filter size and stride. The two main types of pooling layers are the maximum pooling layer and the average pooling layer.

Maximum Pooling:
The pooling layer slides the filter over the output feature map of the previous convolution layer and keeps the maximum value of each grid. , , 2.3.2. Average Pooling: The pooling layer slides the filter over the output feature map of the previous convolution layer and takes the average of the grid.

Flattening Layers
The output of the pooling layer is flattened to a 1 vector because the subsequent dense layers can only receive 1 vectors. A flattening layer can be seen in Figure 6. The dimensionality of the resulting vector is given by: * * Figure 6. Flattening 2 feature maps to 1 vector.

Dense Layers
Dense layers, also known as fully connected layers, are usually placed at the end of a network, and they receive as input the output of the feature extraction layers. The main purpose of the dense layer is to consider all the features that were extracted from the previous layers and to use them to classify the original image. At the end of the network, a softmax or sigmoid function is applied to output the target probability.

Dropout Layer
A dropout layer is a regularization layer that was first introduced by [32]. It can be applied to any layer in the network. During network training, some neurons are disabled with a predefined dropoutrate probability . It can be thought of as bagging for neural networks.

Regularization Layers
Complex models that have large weights usually have a low generalizability since these models can learn noise instead of learning the true model patterns [33]. Under the assumption that models with small weights have a better generalizability than those with large weights, regularization functions are commonly used to limit overfitting. Regularization works by adding a penalty term to the loss function to avoid large weights to be used by the model [34]. The main idea of regularization is to eliminate the weights that do not contribute to the model accuracy by shrinking them to zero. Three types of regularization have been introduced in the literature: L1, L2, and elastic nets. The main differences between these regularizations lie in the penalty terms. 2.7.1. 1 regularization 1 regularization constrains the weights to zero by adding the sum of the absolute values of the weights to the loss function. It can push some weights to be exactly zero and so can be thought of as a feature extractor. The magnitude of the penalty is determined by , so the larger the value of , the higher the constraint to the weights, usually 0 1 . 1 regularization can be formally defined as: (9) where is the number of training examples, denotes the number of weights, is the weight at neuron, is the label, and is the regularization factor.

2 regularization
2 regularization decreases large weights by adding the sum of the squares of the weights to the loss function. The magnitude of the penalty is determined by , so the larger the value of , the higher the constraint to the weights, usually 0 1. 2 regularization can be formally defined as: (10) where is the number of training examples, denotes the number of weights, is the weight at neuron, is the label, and is the regularization factor.

Elastic Net regularization
To overcome the shortcomings of both techniques, elastic net was introduced, as it linearly combines both regularization techniques to benefit from both techniques at once. Elastic net can be defined as: where is the number of training examples, denotes the number of weights, is the weight at neuron, is the label, is the regularization factor, and is the mixing parameter between the ridge ( = 0) and the lasso ( = 1). By combining both 1 and 2, the strength of each term can be tuned by .

Batch Normalization Layers
Batch normalization can speed up the training of the network and increase its robustness against overfitting [35]. It reduces the network covariance shift [36]. Additionally, batch normalization adds noise to each layer to increase its robustness. It works by normalizing the inputs of each layer it is applied to by subtracting the batch mean and dividing by the batch standard deviation.
where is the mini-batch mean, is the mini-batch standard deviation, is the number of instances in the mini-batch, is the zero-centered and normalized input for instance , is the scaling parameter for the layer, is the shifting parameter (offset) for the layer, is a tiny number to avoid division by zero (typically 10 ; it is called a smoothing term), and is the output of the operations ( it is a scaled and shifted version of the inputs). Thus, in total, four parameters must be learned for each batch-normalized layer: , , and .

Transfer learning
Transfer learning is a deep learning technique that is used to rapidly and accurately train a CNN in which its weights are not initialized from scratch. Instead, they are imported from another CNN that was trained on a larger dataset. The most popular set of weights used for transfer learning is from the ImageNet dataset [37]. Several CNN architectures have been trained on the ImageNet dataset and have achieved a high accuracy. These weights can be used to classify another completely different dataset instead of randomly initializing the weights from scratch. There are four strategies in transfer learning. The first strategy is to remove the original fully connected layers that act as classifiers, freeze the entire network weights, use the CNN pre-trained layers as feature extraction, and then add a classifier layer such as a fully connected layer or another machine learning classifier, like a support vector machine. The second strategy is to remove the original fully connected layers, fine-tune the entire network weights by using a very small learning rate (LR), and add a new classifier layer that suits the new task. The third strategy is to remove the fully connected layers, fine-tune only the top layers while keeping the bottom layers frozen, and then add a new classifier layer that suits the new task. Many researchers have suggested that the bottom layers only detect generic features such as edges and circles, while the top layers detect more dataset-specific features. For this reason, many authors recommend only finetuning the top layers [38][39][40]. The fourth strategy is to use a state-of-the-art architecture and start training it from scratch, that is by using only the architecture that has been proven to work on different challenging datasets. A generic CNN model architecture can be seen in Figure 7.

CNN Architectures
In this section, the main CNN architectures used in transfer learning are reviewed. According to [41], the rise of deep learning in image classification started in 2012 by the introduction of AlexNet [14], which introduced the ReLU activation layer as well. The usage of a CNN in image classification increased its accuracy and eliminated the need to feature-engineer each image. After AlexNet, many architectures-namely VGG16, VGG19, ResNet, GoogLeNet, DenseNet, and Xception-were introduced with more features to efficaciously classify images.

VGG Network Architecture
In 2014, researchers at Oxford's Visual Geometry Group introduced two novel architectures named VGG16 [42] and VGG19 [42]. VGG16 achieved a top five accuracy rate of 91.90% in the ImageNet competition in 2014. The VGG16 architecture has 138,355,752 parameters, five convolution blocks, and three dense layers. Each block contains some convolutional layers and then a max pool layer to decrease the block output size and remove the noise. The first two blocks have two convolutional layers each, and the last three blocks have three convolutional layers each. The size of the kernel that is used throughout this network has a stride of 1. After the five blocks, a flatten layer was added to convert the 3D vector of the blocks to a 1D vector to be inserted into the fully connected layers. The first two fully connected layers have 4096 neurons, and the last fully connected layer has 1000 neurons. After the fully connected layers, a softmax layer is inserted, and this is used to ensure that the probability summation of the output is one. The main difference between VGG16 and VGG19 is that VGG19 has 19 convolution layers instead of 16 convolution layers. The number of parameters increases from 138,357,544 to 143,667,240 because of additional layers. The authors argued that these additional layers make the architecture more robust and can learn more complex architectures.
The main benefit of this network is its sequential blocks, where the sequential convolutional layers that are inserted after each other allow for a reduction of the amount of spatial information needed. The main drawback of this network is that the authors specify more weights for the classifier portion and not to the feature extraction portion. This considerably increases the number of parameters. The networkʹs ImageNet weights are available in the Keras package.

ResNet Network Architecture
ResNet, which stands for residual network, was introduced by He et al. [43] in 2015 and achieved first place in the 2015 ImageNet competition with a top five accuracy rate of 94.29%. It has a total of 25,000,000 parameters. Compared to other architectures, ResNet is a very deep network that can reach up to 152 layers, and it has a unique connection called the residual connection, which is a connection that is applied between the convolutional layers and then passed to the ReLU activation layer. The residual connection makes sure that during backpropagation, the weights learned from the previous layers do not vanish. Three versions (which differ in the number of layers) of this network have been introduced, namely ResNet50, ResNet101, and ResNet152. The main benefit of this network is the use of residual connections, which makes it possible to use a large number of layers. Moreover, increasing the depth of the network (instead of widening it) results in fewer extra parameters. The main drawbacks of this network are the summation in each residual block, which makes the filter size the same. Additionally, this network requires large datasets to be properly trained, thus resulting in a computationally expensive training phase. The networkʹs ImageNet weights are available in the Keras package.

GoogLeNet Network Architecture
In 2014, Google researchers introduced a novel architecture called the GoogLeNet network [44], which is also known as IncpetionV1 architecture. The authors won the ImageNet competition [45] with a top 5 accuracy rate of 92.2%. After the success of InceptionV1, the authors introduced other versions like InceptionV2 and InceptionV3. The main idea of GoogLeNet architecture is to use multiple convolution layers in the same block to go not only deeper but wider and to capture different features of the images; these blocks are referred to as Inception blocks. The most popular GoogLeNet architectures are the InceptionV1 and InceptionV3 architectures. In the InceptionV1 inception blocks, six convolution layers are used, while in the InceptionV3 inception blocks, seven convolution layers are used. In the remainder of the paper, just like in the literature, the InceptionV1 architecture is referred to as the GoogLeNet architecture. The main benefit of this network is the presence of an inception module, which allows the network to capture different aspect ratios of the same image by using the convolution layers in parallel. The main drawback of this network is the computational effort that is needed to train it because the layers are deep and wide. The InceptionV3ʹs ImageNet weights are available in the Keras package. An InceptionV1 block and InceptionV3 block are shown in Figure 8.

AlexNet Network Architecture
AlexNet architecture [14] was the first CNN network to participate in the ImageNet challenge in 2012. It achieved an accuracy rate of 86%, which outperformed all the previous shallow algorithms used in image classification. Since then, CNNs have become the state-of-the-art algorithm in image classification. The AlexNet architecture has 60,000,000 parameters, five convolution layers, and three dense layers. The two-novel introductions in AlexNet were the usage of the ReLU activation function (instead of the sigmoid activation function) and the usage of dropout to overcome the overfitting that can be caused by this deep architecture. The main advantage of this network relies on the fact that the training process is computationally efficient compared with the other networks that have been taken into account. On the other hand, the network is not deep enough to capture complicated features from images.

DenseNet Network Architecture
DenseNet architecture [45] stands for densely-connected convolutional networks. It was inspired by ResNet, but instead of the residual connections, the authors proposed the use of dense blocks. The dense block consists of sequentially placed convolution layers, like VGG, but each layer has a connection to all the subsequent layers. The main idea is, for each convolution layer to receive the information from all the previous layers. DenseNet has 8,062,504 parameters and achieved a 93.34% top 5 accuracy rate on the ILSVCR challenge. The main advantage of this network is the presence of connections between all layers, which reduces the information loss between layers (especially the deep layers). The main drawbacks are the following: The training phase is computationally expensive, and it requires very large datasets to achieve satisfactory performance. The networkʹs ImageNet weights are available in the Keras package.

Xception Network Architecture
The Xception (which stands for extreme inception) network was introduced by Chollet [46], and it was inspired by the InceptionV3 architecture. The main idea that is exploited by the Xception architecture is to replace the inception module with depthwise separable convolution, followed by a pointwise separable convolution. This network is 71 layers deep, and it has 22.9 million parameters. The Xception network achieved a 94.50% top 5 accuracy rate on the ILSVCR challenge. The main advantage of this network is that it has a deep architecture but with a small number of parameters, thus making it computationally efficient compared to other deep networks. The main drawback is that this network requires very large datasets to be able to train all its parameters. Table 1 shows a summary of the proposed networks with their number of parameters and their accuracy over the ImageNet dataset [37]. The accuracy is calculated by dividing the correctly classified observations over the total number of observations. The accuracy is the accuracy of the architecture over predicted labels , where the top 5 accuracy represents the accuracy over 5 classes accuracy and the top 1 accuracy represents the accuracy for a single-class classification. When 5, the accuracy is measured by taking into account if the label is present in the top 5 predicted labels , while if 1, is the de-facto accuracy measure. The accuracy measure was used here because the ILSVRC challenge had 1000 classes.

DR Datasets
Several DR datasets were made publicly available to allow researchers to develop algorithms that are able to classify DR. A brief description of these datasets is given in this section. DR Datasets descriptions are shown in Table 2.

Kaggle Dataset
The Kaggle DR [47] dataset is considered one of the most important datasets for DR because it includes more than 88,000 publicly available images that were captured by using different cameras at different angles and dimensions. This dataset is divided into 40% for training and 60% for testing, and various cameras took the images. Therefore, different levels of quality appear in this dataset. The annotation of this dataset is a five-class annotation, as proposed by Wilkinson et al. [10]. The dataset suffers from imbalance, as the rare DR levels (3 and 4) cover less than 5% of the dataset.

Messidor Dataset
Messidor [48,49] is a publicly available dataset that consists of 1200 DR images. This dataset, like the Kaggle dataset, was acquired by using different cameras and settings, and it was built by collecting images from three different hospitals in France. This dataset is more balanced than Kaggle's because each class is distributed uniformly. The DR grades are divided into four grades.

DR1 Dataset
DR1 [50] is a publicly available dataset that was provided by the Federal University of Sao Paulo, Brazil. The dataset contains 1014 images with 68% normal images and 32% DR images. All the images were captured by using the same camera.

E-ophtha Dataset
The E-ophtha dataset [51] is a publicly available dataset that contains two main subsets of images. The E-ophtha_Ex dataset has the objective of detecting exudates in fundus images. This dataset has 82 images split into 47 fundus images with exudates and 35 images without exudates. The other dataset is the E-ophtha_MA, and the objective of it is to detect microaneurysms in fundus images. This dataset contains 381 images divided into 148 images with aneurysms and 233 without arterial swelling.

STARE Dataset
The STARE dataset [52] is a publicly available dataset that contains 400 images that were captured by using the same camera. It has 397 fundus images divided into 14 retina-related diseases.

Paper review
This section discusses the selected papers based on different aspects like the architecture used, the target dataset used, the optimizer used, and the LR used, the performance of the architecture after transfer learning, the fine-tuning process employed, and, finally, the validation process whenever applicable.
In transfer learning, a set of weights that were learned from an image dataset can be used to classify another image dataset. The deep layers are generic and can be used to extract salient features that are suitable for classifying any image. This aspect is why many authors have tried to use transfer learning in detecting DR. For instance, Gulshan et al. [53] used InceptionV3 architecture to classify DR into two grades: DR or No DR. The dataset that was considered by the authors contained 128,175 images. The reported results on two test datasets, with sizes 9963 and 1748, had sensitivities of 97.5% and 96.1%, respectively. Masood et al. [54] used the Kaggle dataset to assess the performance of the InceptionV3 model to classify DR into five grades. The authors chose 4000 images and cropped them to 500 pixels. The authors used accuracy to assess the modelʹs performance, which was reported as 48.8%.
Li et al. [55] discussed the usage of transfer learning for detecting DR by comparing different network architectures, including AlexNet, VGG-S, VGG16, and VGG19, to two datasets: the Messidor and DR1 datasets. Three transfer learning techniques were analyzed: fine-tuning the entire networks, fine-tuning the networks layer-wise, and, finally, freezing the weights of the entire network and applying SVM as a classification layer. The authors used a stochastic gradient descent for the optimizer, and the images were pre-classified as either DR or No DR to pose a binary classification problem. The accuracy measure used was the AUC of the ROC curve. The highest AUC achieved was obtained by fine-tuning the entire network, while the second-best performance was achieved by finetuning layer-wise. The VGG-S architecture obtained the highest AUC that was achieved for the Messidor dataset with an AUC of 98.34%. For the DR1 dataset, an AUC of 97.86% was obtained by using the same network.
Mohammadian et al. [56] compared the InceptionV3 and Xception architectures to classify DR into two grades, DR or No DR, by using the Kaggle dataset. The authors used the whole dataset of 35126 images, with 20% of the images being used to test the algorithm's performance over unseen data. The authors fine-tuned the last two blocks of the two architectures and compared two optimizers with different LRs: stochastic gradient descent and Adam. The authors augmented the images by horizontally and vertically flipping the images or by shifting and rotating the images to increase the robustness of the model. The authors used the accuracy measure to assess the performance of the architectures. The reported results were 87.12% for the InceptionV3 architecture and 74.49% for Xception.
Takahashi et al. [57] trained a modified GoogLeNet architecture by using a private dataset. They used 9443 images to train the model and 496 to test it. They cropped the images to 1272 1272 pixels, and they considered a four-class classification scheme. The reported accuracy was 81%, and the kappa score was equal to 0.74. Choi et al. [58] investigated the impact of transfer learning on the STARE dataset [52]. They used image augmentation techniques to increase the size of the dataset to 10,000 images, with ten retina disorder categories, including DR. The authors opted for the pretrained VGG19 and AlexNet architectures. An ensemble was created to increase the network accuracy, and K-fold validation with k = 5 was used to validate the results. The highest accuracy that was obtained by the authors was achieved by using VGG19 architecture with random forest (RF) as a classifier.
Wang et al. [59] investigated transfer learning techniques by using three network architectures: AlexNet, VGG16, and InceptionNetV3. The authors used 166 images from the Kaggle dataset to tune the algorithms. The authors opted for the five-stage classification approach instead of the binary classification approach that has been used by other authors for this specific dataset. Additionally, they employed a stochastic gradient descent optimizer with Nesterov momentum to accelerate the convergence to the minimum. The authors cropped the images for each architecture to 227 227 for AlexNet, 224 224 for VGG16, and 299 299 for InceptionV3. They used the accuracy of the network as the evaluation metric, and they used with 5 to cross-validate the results. The best-reported accuracy was 63.2% for the InceptionV3 architecture. Hazim et al. [60] used 580 images from the Messidor dataset to test the transfer learning of AlexNet. They opted for a two-class classification, and they cropped the images to 227 227. They achieved an 88.3% accuracy on the test set, which consisted of 290 images.
Lam et al. [61] considered the sliding windows algorithm, where small patches from the original images are used to train the CNN. These patches contain the important features of each image, such as the presence of exudates or microaneurysms. The authors used the Kaggle dataset to extract these patches. They extracted 1324 patches from 243 images and split these patches into training and testing datasets. They tested the proposed algorithm by using the E-Optha dataset, which contained 195 images. They used GoogLeNet architecture to train the model with an input size of 128 128. The authors considered a multi-class classification task with five DR grades. They resized the test images to 2048 2048 and normalized the pixels to test the model. Subsequently, the trained model crossed over the test image to produce a heat map with a probability score for every one of the five grades. The authors compared five pre-trained architectures (AlexNet, VGG16, GoogLeNet, ResNet, and InceptionV3) for binary classification and multi-class classification. The best performing architecture was InceptionV3 with a multi-class accuracy of 96% and a binary-class accuracy of 98%.
Lam et al. [62] trained a CNN by using transfer learning of the AlexNet, VGG16, and GoogLeNet models, and they utilized Kaggle two-class output. The authors reported that GoogLeNet achieved the highest sensitivity of 95% and specificity of 96%. The authors tried to utilize the multi-class Kaggle dataset, but they stated that a CNN cannot learn mild class sensitivity. The authors achieved decent results for detecting mild grades when using the Messidor dataset. Wan et al. [63] compared the difference between transfer learning and learning from scratch. The authors used four CNN architectures, namely AlexNet, ResNet, GoogLeNet, and VGG. The authors performed their experiments on the full Kaggle dataset, and they used the AUC of the ROC curve, accuracy, sensitivity, and specificity as evaluation criteria. The authors reported that transfer learning did significantly increase the performance of CNN, with VGG-S producing the highest AUC.
Xu et al. [64] studied the difference between the performance of DenseNet with and without fine-tuning. The authors examined their method on a private dataset with 10,000 images and five grades. The authors used image augmentation to increase the size of the dataset and to balance the dataset between different classes. The final dataset contained 20,000 images that were distributed equivalently between the five classes. The authors used a stochastic gradient descent (SGD) as an optimizer with an LR of 0.1 for training from scratch and an LR of 0.01 for fine-tuning the network. The authors reported that transfer learning increased the accuracy of the model used.
Tsighe et al. [65] investigated the usage of the InceptionV3 architecture to detect DR in the Kaggle dataset. The authors chose 2500 images and cropped them to 300 300 to train the model, and 5000 images were used to test the model. The authors pre-classified the images as either DR or No DR to make it a binary classification task. They employed a stochastic gradient descent as an optimizer, with an LR of 0.0005, to fine-tune the neural network. The reported result was a 90.9% accuracy and a 3.94% loss. Chen et al. [66] considered the pre-trained InceptionV3 architecture to classify DR on 7023 images of the Kaggle dataset. The authors adopted a five-stage classification approach with the quadratic weighted kappa as an accuracy measure. The images were cropped to 229 229, and a stochastic gradient descent was used as an optimizer. Image augmentation was used with an early stop for 15 iterations to overcome the overfitting of the network. The reported Kappa score was 0.64, with an accuracy of 80%.
Zeng et al. [67] proposed a novel Siamese-like architecture in which left and right fundus images were classified together. Siamese neural networks are networks with two parallel neural networks, and each of these networks takes different inputs. The authors used the Kaggle dataset with 28,104 training images split between right and left eyes and 7024 to test the architecture. They used the pretrained InceptionV3 network on the ImageNet dataset. The authors examined the five-stage classification, as proposed by Wilkinson et al. [10], and opted to use a binary class classification. They used Adam as an optimizer, quadratic weighted kappa as the accuracy measure for the multiclass classification, and the AUC of the ROC for the binary classification. All the layers of InceptionV3 were fine-tuned, and the images were cropped to 229 229. The authors augmented images by randomly flipping them horizontally and by randomly applying a geometric transformation to increase the dataset's size and to control overfitting. They normalized all images from [0,255] to [−1,+1]. They reported the kappa result as 0.829 for the multiclass classification and an AUC of 95.1% for the binary classification.
Zhang et al. [68] used a private dataset with 13,767 images to propose a model called DeepDR, which uses deep learning based on transfer learning models to detect DR. The model consists of three stages: identification, grading, and reporting. The identification stage is a binary classification model to predict the presence of DR. If DR exists, then the image is graded by using the grading stage of the four stages of DR; the last stage reports the result of the model. The authors used InceptionV3, Xception, and InceptionResNetV2 for feature extraction in the identification system. Moreover, they added a global average pooling layer to normalize the output of the feature extractor, and they subsequently added four dense layers with sizes 1024, 512, 256, and 128, respectively. A dropout layer between the dense layers, with a probability of 50%, was employed to limit overfitting. Due to its speed of convergence, the authors opted for the LeakyReLU activation function with a of 0.2 and, in the end, a softmax layer to sum up the probabilities to 100%. For the grading system, the authors used ResNet50, DenseNet169, and DenseNet201 for feature extraction in the grading system. They then added a global average pooling layer and four dense layers with sizes 2048, 1024, 512, and 256, respectively. They employed a dropout layer between the dense layers with a probability of 50%, LeakyReLU, as the activation function for all the dense layers with of 0.2 and, in the end, the softmax layer. The authors averaged the outputs of the softmax layer of the three models to decrease the variance of the model output. The identification model achieved a sensitivity of 97.5% and a specificity of 97.7%, while the grading model reached 98.1% for sensitivity and 98.9% for specificity.
Yip et al. [69] explored three CNN architectures, namely VGG, ResNet, and an ensemble of both architectures. The authors experimented with using a private dataset with three classes of DR and with 148,266 images divided into 51.5% to train and 48.5% to validate the model. Three measures were used to assess the quality of the model, namely AUC, sensitivity, and specificity. The authors reported that transfer learning increased model accuracy. Gao et al. [70] used a private dataset with 4476 images with four classes. The authors cut the original images into four 300 * 300 partitions that were the input of four InceptionV3 networks, and then they concatenated the results to a single layer. The original fully connected layers were removed, and only a softmax layer was used. The Adam optimizer was employed to fine-tune the InceptionV3 networks. The authors compared their method against ResNet18, ResNet101, VGG19, and InceptionV3. The reported results showed that their model achieved a higher accuracy than the other models. Table 3 shows the list of the reviewed papers that applied transfer learning to classify DR.  [55] are the results of fine-tuning the entire networks by using the Messidor dataset. The results from Zhang et al. [68] are the results of the grading model. The results shown from Lam et al. [62] are the results of GoogLeNet architecture for the two-class Kaggle dataset. The results from Yip et al. [69] are the results of the vision-threating DR. The results shown from Xu et al. [64] are the results of using transfer learning with 24 kernels. The VGG19 results shown from Choi et al. [58] are the VGG19 with transfer learning and RF as a classifier.

Discussion
This study reviewed recent studies that implemented transfer learning in classifying diabetic retinopathy images. These studies were extracted from two databases (PubMed and Scopus), and, after applying two filters, 18 studies were selected. The selected papers were analyzed based on six aspects: the architecture used, the target dataset used, the optimizer used, the LR used, the performance of the architecture after transfer learning, the fine-tuning process used, and, finally, the validation process that was applied. In this section, we discuss the main findings of this analysis.

Architectures used
In the reviewed articles, many state-of-the-art architectures were used to classify DR. Among them, InceptionV3 was the most commonly used, followed by the AlexNet and VGG16 architectures. The choice of the architectures did not depend on the size of the dataset. In studies [55,55,58,59,[61][62][63]68,69,70], the authors compared different architectures to determine the best performing one. In studies [56,59,61,63,70], the authors compared InceptionV3 architecture to other networks, and InceptionV3 achieved the best performance in all the studies except for [63]. The lowest performance was achieved by the AlexNet architecture in the following studies: [58,59,[61][62][63]. The high performance of InceptionV3 may be attributed to the inception module used. This module can capture different aspect ratios in the same image, which was shown to be very useful in DR images. The low performance of AlexNet could have been caused by the fact that it only uses five convolution layers. This number is not sufficient to accurately classify challenging images, like DR. A summary of the architectures used is shown in Table 4.

The datasets used
In the reviewed papers, the most commonly used public datasets were the Kaggle dataset due to its availability and its size, followed by the Messidor dataset. Many private datasets were used as well in studies [53,57,64,68,69]. Many researchers like Tsighe et al. [65], Li et al. [55], Mohammadian et al. [56], Hazim et al. [60], Lam et al. [61], and Lam et al. [62] considered a binary classification task due to the lack of a sufficient number of images for some of the classes. In particular, the lack of severe cases images plays an important role because there are too few images that are available for training the network. An important factor that affected the performance of the classifier was the size of the datasets. It played a significant role in classification performance, especially when using an algorithm like CNN. The second important factor was the number of classes of each dataset, with the binary classification outperforming the multiclass classification. This can be attributed to the unbalance of the datasets and to the difficulty (for some of the models used) in distinguishing among more than two classes. This difficulty was caused by the low number of examples of a given class, as well as by the quality of the images. A summary of the datasets used is shown in Table 5.

The optimizers used
The main task of the optimizer during network training is to update the weights to reduce the value of the loss function. The optimizer can have a huge impact on the convergence of the training process, especially for transfer learning, as pointed out by Mohammadian et al. [56] and Lam et al. [62]. Four optimizers were mainly reported by the authors, namely SGD for studies [55,[52][53][54][55][56][57][58][59][60][61][62][63][64][65] SGD with momentum for studies [56,58,59], Adam for studies [56,67,70], and RMSProp in [68]. The stochastic gradient descent optimizer (SGD) allows for a faster training process than the traditional gradient descent because it only considers, at each iteration, a subset of the training set. Thus, it generally achieves faster iterations in trade for a (slightly) lower convergence rate. SGD with momentum (SGDM) can be used instead of SGD: by adding the momentum and thus determining the next update of the weights based on a linear combination of the gradient and the previous update, it prevents the training process to show oscillatory behavior. This should result in faster and accurate convergence. The RMSProp optimizer, a member of the adaptive gradient group, was introduced to overcome the problem of determining the initial value of the learning rate, which is now learned during the training process. The Adam optimizer was introduced to combine the benefits from both the SGDM and the RMSProp optimizer.
The LR chosen by the authors was very low to avoid losing the original weights of the layers, and it ranged from 1 10 to 1 10 . Wang et al. [59]  The choice of the optimizer and the learning rate can play a vital role in network performance and the convergence time, especially when using transfer learning. Optimizers like SGD and SGDM can take a longer time to reach convergence, while the RMSProp optimizer can take a shorter time but might not reach the same performance of SGD and SGDM. The Adam optimizer can reach the performance of SGD and SGDM while taking a shorter time, like RMSProp. The learning rate is very important as well because the choice of a high learning rate can completely change the pretrained weights, thus deteriorating the performance of the network. On the other hand, with a low learning rate value, the network weights will be adjusted to the new dataset without completely change the original weights. A summary of the optimizers that were found in the reviewed papers is shown in Table 6.

The performance difference by applying transfer learning
The suitability of transfer learning for DR image classification can only be assessed by comparing the architecture that was trained from scratch to its fine-tuned version. Masood et al. [54] reported that the network accuracy increased from 37.6% to 48.8% by using transfer learning on the InceptionV3 architecture that was trained on the Kaggle dataset. Wan et al. [63] confirmed the effect of transfer learning on six state-of-the-art architectures that use a full-size Kaggle dataset. The authors reported that the accuracy increased significantly by using transfer learning, and they also observed that using transfer learning significantly decreased the overfitting. Xu et al. [64] reported that the accuracy of DenseNet architecture significantly increased by using transfer learning with a private dataset.
From the results obtained by the previously mentioned studies, we can conclude that transfer learning can provide a significant contribution to the classification of DR. The DR images are very challenging to classify, and, usually, the DR datasets only have a limited number of images. For this reason, the use of transfer learning is particularly suitable for achieving high accuracies instead of training the networks from scratch.

The fine-tuning technique
Fine-tuning the entire network was the most commonly used method for transfer learning in the reviewed papers. Some novel approaches were introduced, like the Siamese network presented by Zeng et al. [67], where two networks were used in parallel. Li et al. [55] compared three different transfer learning techniques, namely fine-tuning the networks, fine-tuning networks layer-wise, and feature extraction. The highest AUC was achieved by fine-tuning the entire networks. Zeng et al. [67] reported that they fine-tuned the entire InceptionV3 to suit the Kaggle dataset. Mohammadian et al. [56] compared the fine-tuning of the last two layers against fine-tuning the last four layers and feature extraction. They confirmed that fine-tuning the last two layers achieved the highest performance for InceptionV3. Lam et al. [62] froze the weights of AlexNet and GoogLeNet architectures and employed feature extraction. Not all the authors reported the method that they used to fine-tune their architecture, while others stated that they fine-tuned the network without explicitly stating how.

Performance validation
Two main methods are commonly used to validate model performance, namely k-fold validation and splitting the dataset into training and test sets. Depending on the size of the dataset, some authors opted to use the test split method (usually with an 80%/20% split), while other authors used k-fold validation, especially if the target dataset was small in size. Wang et al. [59] used k-fold with 5 to validate their results, taking into account that they only had 166 images in their dataset.
Li et al. [55] used k-fold with 5, and the sizes of the datasets used were 1200 and 1014. Zeng et al. [67] and Mohammadian et al. [56] used 20% of their dataset to validate the results of their dataset, which was a full-size Kaggle dataset with a size of 35,128 images. Lam et al. [61] validated their results by using different test datasets.

Open questions
In this section, we discuss various challenges that the researchers have not addressed in the previous literature about using transfer learning for DR classification. Further research is needed to improve the performance of the networks and to explore other powerful techniques. Some challenges that deserve further investigation are listed below.

The effect of layer-wise fine-tuning instead of full fine-tuning on DR image classification
One of the main questions of applying transfer learning to DR is how deep to fine-tune the network, taking into consideration the size of the DR dataset and the architecture used. This question still needs further studies to understand the effect of each layer on the networkʹs performance and to determine how deep to fine-tune a CNN. Full fine-tuning can be very computationally expensive, as it requires a lot of time, and it may not always guarantee to converge better than top-layer fine-tuning.

The effect of the optimizer used and the learning rate used in DR image classification
For DR datasets, the optimizer that is used can have a huge impact on the performance of the network and the time needed for convergence. The choice of initial LR is still a very debatable area, especially in fine-tuning. Two questions to be answered are the following: does it depend on the size of the DR dataset or not? Do we need different LR for full fine-tuning, for top-layer fine-tuning, and feature extraction?

The effect of the batch size used in DR image classification
The impact of batch size on the fine-tuning process still needs to be investigated in detail because this can have a huge impact on the network's performance. Additionally, its relationship with the size of the DR dataset and the architecture used deserves further analysis.

The effect of choosing another dataset than ImageNet
ImageNet is the de-facto database when it comes to transfer learning because it is trained on millions of images with thousands of classes. What is the effect if the ImageNet was substituted with another large dataset to perform transfer learning for DR datasets? Currently, there is no medical image dataset that can play the same role as the ImageNet dataset. Thus, an effort in the medical community would be fundamental to build a vast dataset that can be used to train different architectures that are designed to address the DR classification task.

The effect of image augmentation
Is image augmentation needed in DR classification? The geometric transformation of DR images, like rotation and transformation, can distort them and mask important features that the algorithm can use to output the predicted grade. Additionally, the usage of image augmentation with transfer learning for DR needs further investigation because image augmentation was mainly introduced to mitigate the effect of small datasets, but transfer learning is used for the same reason.

Conclusion
The computer-assisted detection of medical images is a recently emerging application of artificial intelligence that can save time, money, and manpower. The main challenge of using CNN in medical image classification is the size of the training dataset, which is typically limited since an experienced doctor is required to annotate each image and, sometimes, even resort to a second opinion to classify some difficult images. Transfer learning can be a viable option considering its suitability when a limited number of training observations are available to address the image classification task. Thus, transfer learning can play an important role in the medical field. Complex and deep architectures are being developed to solve tasks related to computer vision, and these architectures can be successfully applied to solve the challenges of in the field of medical images. This paper reviewed CNN-based techniques for classifying DR images. Though many novel architectures have been proposed to solve DR classification, the current paper only focused on transfer learning-based methods and how transfer learning can be applied to classify DR images.

Conflicts of Interest:
The authors declare no conflict of interest.