Blended Multi-Modal Deep ConvNet Features for Diabetic Retinopathy Severity Prediction

Diabetic Retinopathy (DR) is one of the major causes of visual impairment and blindness across the world. It is usually found in patients who suffer from diabetes for a long period. The major focus of this work is to derive optimal representation of retinal images that further helps to improve the performance of DR recognition models. To extract optimal representation, features extracted from multiple pre-trained ConvNet models are blended using proposed multi-modal fusion module. These final representations are used to train a Deep Neural Network (DNN) used for DR identification and severity level prediction. As each ConvNet extracts different features, fusing them using 1D pooling and cross pooling leads to better representation than using features extracted from a single ConvNet. Experimental studies on benchmark Kaggle APTOS 2019 contest dataset reveals that the model trained on proposed blended feature representations is superior to the existing methods. In addition, we notice that cross average pooling based fusion of features from Xception and VGG16 is the most appropriate for DR recognition. With the proposed model, we achieve an accuracy of 97.41%, and a kappa statistic of 94.82 for DR identification and an accuracy of 81.7% and a kappa statistic of 71.1% for severity level prediction. Another interesting observation is that DNN with dropout at input layer converges more quickly when trained using blended features, compared to the same model trained using uni-modal deep features.


Introduction
Diabetic Retinopathy (DR) is an adverse effect of Diabetes Mellitus(DM) [1] that leads to permanent blindness in humans. It is usually caused by the damage to blood vessels that provide nourishment to light-sensitive tissue called the retina. As per statistics [2], DR is the fifth leading cause for blindness across the globe. According to the World Health Organization (WHO), by 2013, around 382 million people are suffering from DR, and this number may rise to 592 million by 2025. It is possible to save many people from going blind if DR is identified in the early stages. Small lesions are formed in the eyes of DR effected people and the type of lesions formed decides the level of severity of DR. Figure 1a shows types of lesions DR can be categorised into five different stages [3]: No DR (Class-0), Mild DR (Class-1), Moderate DR (Class-2), Severe DR (Class-3) and Proliferative DR (Class-4). Sample retinal images with different severity levels of DR are shown in the figure 1b. Mild DR is the early stage during which the formation of Micro Aneurysms (MA) can be observed. As the disease progresses to Moderate stage, swelling of blood vessels can be found, which leads to blurred vision. During the later Non-Proliferative DR (NPDR) stage, abnormal growth of blood vessels can be noticed. This stage is severe due to the blockage of a large number of blood vessels. Proliferative DR (PDR) is the advanced stage of DR, during this stage retinal detachment along with large retinal break can be observed that leads to complete vision loss [4].
In traditional DR diagnosis approaches, manual grading of the retinal scan is required to identify the presence or absence of retinopathy. If DR is confirmed as Positive, further diagnosis is recommended to identify severity level of the disease. This kind of diagnosis is quiet expensive and time consuming as it demands human expertise. If DR identification is automated then diagnosis of the disease becomes affordable to many people. In the recent past, several machine learning tools have been introduced to address the same.
Early approaches to DR identification, where the presence or absence of DR is revealed, focuses on spotting the Hard Exudates (HEs). A dynamic threshold based Support Vector Machine (SVM) is used to segment HE in the retinal images [5]. Fuzzy C-means is used to detect HE and SVM is used to identify severity level of the disease to make the system more sophisticated [6]. SVM based classifiers are adapted to find cotton wool spots in the retinal images.
With the introduction of deep learning, focus of the researchers has been shifted from spotting HEs to MAs. A two step CNN is introduced to segment MAs in the given retinal scans [7]. Another CNN architecture that is trained using selective sampling approach is proposed to detect hemorrhages [8]. A max-out activation is introduced to improve the performance of a DNN model for which DR is used as an application to find MA [9]. Recently a bounding box based approach is introduced to identify the region of interest in the retinal images [10]. Though good number of methods are available in the literature, they are either sub-optimal or complex. Hence there is a need for a solution that is simple and robust.
The objective of this work is to design a simple and robust deep learning-based approach to recognize DR from the given retinal images. Major focus this work is to obtain a better feature representation of the retinal images which ultimately leads to the better model and to accomplish this, we propose Uni-modal and Multi-modal approaches. Initially, for the given retinal images, deep features are extracted from different pre-trained ConvNets like VGG16, NASNet, Xception and Inception ResNetV2. In Uni-modal approach, features extracted from a single pre-trained ConvNet gives the final feature representation. In multi-modal approach, our idea is to blend the deep features extracted from multiple ConvNets to get the final feature representation. We propose different pooling based approaches to blend multiple deep features. To check the efficiency of our feature representation, a Deep Neural Network (DNN) architecture is proposed for identification of DR (task1) and to recognize severity level of DR (task2). We observe that in multi-modal approach, blending deep features from Xception and Inception ResNet V2 outperforms others in both the tasks. Another interesting observation is that there is a drop in the number of false positives which is most desirable. Experimental studies on the benchmark APTOS 2019 dataset reveals that our blended feature representations trained using DNN model gives a superior performance compared to the existing methods.
Following are the major contributions of the proposed work: • Effectiveness of the uni-modal feature representation is verified.
• A blended multi-modal feature representation approach is introduced • Different pool based approaches are proposed to blend deep features.
• A DNN architecture with dropout at the input layer is proposed to test the efficiency of the proposed uni-modal and blended multi-modal feature representations. • APTOS 2019 benchmark dataset is used to compare the performance of the proposed approach with existing models

Related Work
In the resent past, machine learning models are very popular to solve various problems like image classification [11], text processing [12], real-time fault diagnosis [13] and healthcare [14,15]. It is very common to use ML algorithms to address disease prediction [16,17] [18].
In this section we report various conventional models available in the literature for the task of DR recognition. In [19], an easy to remember scientific approach has been introduced for DR severity identification. In [20], the authors presented a hybrid classifier by using both GMM and SVM as an ensemble model to improve the accuracy of the model. The same approach has been modified by augmenting the feature set with shape, intensity, and statistics of the affected region [21]. A random forest-based approach is proposed in [22] [23] and segmentation based approaches are proposed in [24]. In [25], a genetic algorithm-based feature extraction method is introduced. Different shallow Classifiers such as the GMM, KNN, SVM, and AdaBoost are being analysed [26] to differentiate lesions from non-lesions. A hybrid feature extraction based approach is used in [27].
In the next few lines, deep learning models available in the literature for the task of DR severity identification are introduced. A large dataset consisting of 1,28,175 retinal images is used and trained using deep CNN. In [28] data augmentation method is used to generate the data on CNN architecture. Fuzzy models are used in [29], a hybrid model that is designed based on fuzzy logic, Hough Transform and numerous extraction methods are being implemented as part of their system. A combination of fuzzy C-means and deep CNN architectures are used in [30]. A Siamese Convolutional Neural Network is used in [31] to detect diabetic retinopathy.
With the introduction of deep learning models, focus has been shifted to deep feature based models. In [32] Muhammad Mateen used features extracted from different layers of pre-trained ConvNet like VGG19 and further applied PCA and SVD on those features, for dimension reduction [33] to avoid over-fitting. In the case of former models, the model is not robust, and in the latter case, the models are robust, but large datasets are needed to train the model. A PCA based fire-fly model [34] along with deep neural network is used for DR detection [35], UCI repository is used for the experiments.
Performance of any ML algorithm is subject to the features extracted from the given data. Conventional ML models need a separate algorithm (GIST, HOG and SIFT) for feature learning and gives a global or local representation of the images and the features. Features extracted in this process are known as hand crafted features. Till the entry of deep learning models, these handcrafted features were dominant and being widely used for feature extraction.

Deep ConvNets for feature extraction and transfer learning
Deep learning models [36][37][38] learn the essential characteristics of the input images. This exceptional capability of the deep models make them representation models, as these models can represent the data efficiently and reduce the use of the additional feature extraction phase where features are handcrafted. Deeper layers of the CNN models can represent the entire given input efficiently than the early layers.
The downside of the deep learning models is that they need enormous amounts of data for training, which is usually scarce for most of the real-time applications. This problem can be addressed by the introduction of transfer learning, where the knowledge gained by a deep learning model can be transferred to other models. To achieve this deep pre-trained CNN models like VGG16, ResNet152 are available for transfer learning. Pre-trained models are the models that are trained on large amounts of data, and the weights updated during the training of the complex model can be applied to similar kind of tasks.
There are different types of pre-trained models which are trained on large scale datasets such as ImageNet that consists of more than a million images. Popular pre-trained deep CNN models like VGG16, VGG19, ResNet152, InceptionV3, Xception, NASNet, Inception ResNet V2 and DarkNet are briefly described below: • Visual Geometric Group (VGG 16): VGG16 is a deep ConvNet trained on 14 million images belonging to 1000 different classes and topped the leader board in ILSVR (ImageNet) challenge. In this architecture, 2X2 filters are used with stride 1 for convolution operation, and 2X2 filters with stride two and same padding are used for max-pooling operation across the network. At the end of architecture, two fully connected dense layers of 4096 neurons are connected followed by soft-max layer. • Neural Architecture Search Network (NASNet): This is a special kind of Deep CNN which searches for a better architectural building block on small datasets like CIFAR10 and transfer it to larger datasets like ImageNet. It has a better regularisation mechanism called Scheduled drop path, which significantly improves generalisation. • Xception: Xception is another deep ConvNet architecture that supports depth-wise separable convolution operations and outperformed ResNet and InceptionV3 in ILSVR challenge. • Inception ResNetV2: This is popularly known as InceptionV4, as it combines architectures of two different architectures called InceptionV3 and ResNet152. It has both inception and residual connections which boost the performance of the model.
Deep neural networks give excellent performance only when trained with extensive data. If the data used to train is not sufficient, then the DNN models tend t overfit. Deep, Convolutional Neural Networks are introduced in [39] for the task of Scalable Image Recognition. Xception, a deep CNN is developed using depthwise Separable convolutions to improve the performance [40]. A flexible architecture has been defined in [41], which can search for a better convolutional cell with better regularisation mechanism. All these models are trained on ImageNet Dataset for ILSVR challenge.
Our objective is to create a robust and efficient model to recognise DR with limited datasets and with limited computational resources. To achieve our objective of creating a robust model with small datasets, we seek the help of transfer learning and use various pre-trained ConvNets to extract deep features. We use the knowledge of these models to extract the most prominent features of colour fundus images. A deep neural network with dropout introduced at early layers is trained to detect and classify the severity levels of diabetic retinopathy. As we introduced dropout at the input layer, deep neural network is immune to over-fit.

Proposed Methodology
In this work, our objective is to develop a robust and efficient model to automate DR diagnosis. We focus on the extraction of deep features that are most descriptive and discriminate which ultimately improves the performance of DR recognition. In order to get an optimal representation, features are extracted from multiple pre-trained CNN architectures and are blended using pooling based approaches. These final representations are used to train a Deep Neural Network with a dropout at the input layer. Proposed model has three different modules: feature extraction, model training, and evaluation module.

Feature Extraction
Performance of any machine learning model is highly influenced by the feature representations and the same is applicable to models used for DR recognition. With this motivation, we propose two different approaches (uni-modal and multi-modal) to extract optimal features from the given retinal images.
In the proposed work, initial representations of the retinal images are obtained from the pre-trained VGG16, NASNet, Xception Net and Inception ResNetV2. As each of the pre-trained model expects input images of varying sizes, given retinal images are reshaped according to the input dimensions accepted by these models for example, when VGG16 is used images are reshaped to 224*224*3. These reshaped retinal images are fed to the pre-trained models after removing the soft-max layer and freezing the rest of the layers. Activation outputs from the penultimate layers form the basis for the proposed feature extraction module. For each retinal image deep features are extracted from the pre-trained ConvNets and following are the details: • Each of the first (fc1) and second (fc2) fully connected layers of VGG16 produces a feature vector of 4096 dimensions • The final global average pooling layer of NASNet, Xception and InceptionResNetV2 gives feature vectors of size 4032, 2048 and 1536 respectively Figure 2 gives the architectural details of the pre-trained VGG16, NASNet, Xception and InceptionResNetV2 and pointers are marked at the feature extraction layers. These features form the input to the proposed uni-modal and blended multi-modal approaches to obtain the optimal feature representations of the retinal images.

Uni-modal deep feature extraction:
In this approach, deep features are extracted from the final layers of one of the pre-trained ConvNets (VGG16, NASNet, Xception, ResNet V2) to get the global representation of the retinal images. These deep features are fed to classification models for DR identification and recognition. We propose to use DNN architecture with a dropout at the input layer for DR identification and classification. Figure 3 gives the details of different stages involved in DR recognition process that uses uni-modal deep ConvNet features.  We propose various pooling approaches to fuse the deep features extracted from multiple pre-trained ConvNets. The final blended deep features provide better descriptive and discriminate representation of the retinal images. These blended features are fed to the classification models for DR identification or severity recognition. Figure 4 gives the details of different stages involved in DR recognition process that uses blended multi-modal deep ConvNet features. The proposed blended multi-modal feature extraction module, uses features from both the fully connected layers of VGG16 (fc1 and fc2) and global average poling layer of Xception as input. The rationale behind choosing features VGG16 and Xception over others is two fold. In VGG16, each feature map of the final convolution block learns the presence of different lesions from the retinal images. Xception Net learns correlations across the 2-D space as a result each feature map provides the comprehensive representation of the entire retinal scan. Figure 5 visualizes the feature maps obtained from the final convolution blocks of VGG16 and Xception models when a retinal image is passed to these models. 1-D pooling based fusion takes one feature vector U as input, and produces another feature vectorÛ, where U ∈ R d1 ,Û ∈ R d2 and d 2 ≤ d 1 .Û is a reduced representation of U, where U = {u 1 , u 2 ...u d1 } and U = {û 1 ,û 2 ...û d2 }. Each feature elementû i , of the output vectorÛ is computed using one of the following three approaches: 1-D Max pooling: 1-D Min pooling:û i = min(u i * 2 , u i * 2+1 ); ∀i ∈ {1, 2...d 2 } 1-D Average pooling:û i = mean(u i * 2 , u i * 2+1 ); ∀i ∈ {1, 2...d 2 } 1-D Sum pooling:û i = u i * 2 + u i * 2+1 ; ∀i ∈ {1, 2...d 2 } In cross pooling based feature fusion, two different feature vectors X, Y are passed as input, and another feature vector Z is produced, where X, Y, Z ∈ R d . Each feature element z i , of the output vector Z is computed using one of the following three approaches: Cross Min pooling: Cross Average pooling: Cross Sum pooling: 1-D pooling is applied independently on features extracted from fc1 and fc2 layers of VGG16. Then cross pooling approach is applied on the resultant pooled features. This feature vector is merged with the features extracted from the Xception using cross pooling. Fusion module produces deep blended features, which are used to train the proposed DNN model. Figure 6 shows the proposed architecture of the deep feature fusion approach used to blend features from different ConvNets. As the final feature vector is a blended version of the local and global representations of the retinal images it provides strong features. Algorithm 1 gives the sequence of steps involved in the blended multi-modal feature fusion based DR recognition.

Model Training and Evaluation:
During this phase, we train the ML model with deep blended pre-trained features. We prefer to use Deep Neural Network (DNN) model for training. For DR identification task, as it is a simple binary classification task, a DNN with two hidden layers with 256, 128 units respectively with ReLU activation is used.
For DR severity classification task, a DNN with three hidden layers with 512, 256, 128 units respectively using ReLU activation is used. For both the DNNs with the input layer we applied 0.2 dropout to avoid model from over-fitting of model. This helped the model to become robust. Figure 7 represents the architecture of proposed approach for model training and evaluation.

Experimental Results
In this section, we provide details of experimental studies that are being carried out to understand the efficiency of the proposed blended multi-modal deep features representation.

Dataset Summary
For the experimental studies, the APTOS 2019 kaggle benchmark dataset available as part of the blindness detection challenge is used [42]. This is a large dataset of retinal images taken using fundus photography under a variety of imaging conditions. The images are graded manually on a scale of 0 to 4 (0 -No DR, 1-Mild, 2-Moderate, 3-Severe, 4-Proliferative DR) to indicate different severity levels. Table 1 Severity Level # Samples

Class-4 (Proliferative Stage) 295
Total 3662 gives the number of retinal images available in the dataset under each level of severity. We can observe that the dataset has an imbalance with more number of normal images, and with very few images in class3. In all the experiments, 80% of the data is used for training and the remaining 20% is used for validation.

Performance Measures:
For the evaluation of the proposed model, we report different measures: Accuracy, Precision, Recall, and F1 Score. In addition, we used an additional metric called Kappa statistic to compares an observed accuracy with an expected accuracy. Kappa Statistic is calculated as Observed accuracy is defined as the number of samples that are correctly classified. Expected accuracy is defined as the accuracy that a classifier would be expected to achieve, which is directly related to the number of examples of each class, along with the number of examples that the predicted value satisfied with the correct label.

DR Identification and Severity level Prediction:
The whole set of experiments carried out in this work are divided into two different tasks. In task1, presence or absence of DR is identified where as in task2, severity level is predicted for the given retinal image.

Task1 -DR Identification:
In this task, given the DR image of a diabetic patient, we need to check whether the person is effected by retinopathy or not. DR identification is a binary classification task, so binary cross entropy loss is used to measure the loss, and Adam optimiser is used to optimise the objective function. The dataset contains images belonging to 5 different classes as shown in table 1 and is not suitable for binary classification task. Merging all the DR effected images into a single class gives 1857 positively labeled images and the remaining 1805 normal images are labeled as negative.

Task2 -Severity level Prediction:
Objective of task1 is to identify the presence or absence of DR, given a retinal image. While treating the DR effected patients, mere identification of DR would not be sufficient and understanding the level of severity would be helpful for better treatment. Hence we treat severity level identification as a separate task that categorises the given retinal image to one of the 5 severity levels. Categorical Cross entropy loss is used to represent loss and Adam optimiser is used to optimise the objective function.

Experimental studies to show the representative nature of uni-modal features for task1
This experiment is carried out to understand how efficiently retinal images are represented using uni-modal features that are directly obtained from single pre-trained ConvNet. Models like VGG16, Xception, NASNET, and ResNetV2 are considered to extract uni-modal features. For classification, models like Naïve Bayes classifier, logistic regression, decision tree, k-Nearest Neighbourhood (KNN) classifier, Multi Layered Perceptron (MLP) Support Vector Machine (SVM) and Deep Neural Network (DNN) are used.  Table 2 and 3 shows the performance of DR identification task using different ML models when the retinal images are represented with the features extracted from the first fully connected layer (fc2) of VGG16 and Xception respectively. With this we came to a conclusion that DNN outperforms the rest of the ML model irrespective of the models. Hence decided to use DNN model alone in the rest of the experiments.
1SW Table 4 shows the representative power of uni-modal features that are extracted from different pre-trained models. It is clear from the results that the performance of the DNN model varies depending on the uni-modal features used. This experiment gives a clue that each pre-trained model extracts a different set of features from retinal images. The features extracted from Xception yields better performance in terms of accuracy for the diabetic retinopathy identification task. A nominal difference in terms of accuracy and kappa score can be observed between the models trained using different uni-modal features. For a better understanding of the representative nature of different uni-modal features, loss and number of epochs taken to converge by the DNN models are reported in Table 5. We can observe that the model trained using VGG16-Fc1 reaches minimum loss compared to the rest of the models. In terms of convergence, Xception takes only 16 epochs whereas performance of Inception ResNetV2 outperformed other models.
To summarize the experiments on DR identification task, features extracted from Xception, VGG16-fc2 and Inception ResnetV2 yields the same accuracy with nominal differences. However, models trained on the VGG16-fc1 features gives better kappa scores compared to others. We can also observe that models trained on the VGG16-fc2 features gives better performance in terms of precision, recall and F1 scores. Regardless of the type of uni-modal features used, DNN consistently outperforms rest of the models especially in terms of kappa scores. The reason for the superior performance of the models trained using VGG16 and Xception features is that these models are good at extracting the lesion information that is useful to discriminate the DR effected images from those that are not effected.

Experimental studies to show the representative nature of uni-modal features for task2
We run a set of experiments to understand the nature of uni-modal features for severity prediction of DR. Task2 is more challenging compared to task1 as it involves multiple classes. DNN model with dropout at the input layer is used with different uni-modal features. Based on the results reported in Table 6, we can observe the same trend that has been observed in task1. The scores obtained for task2 shows the complexity of severity prediction. The model trained on VGG-16+fc1 features shows superior performance than rest of the models. The same can be observed in terms of all the metrics.  Table 7 it is clear that among all the pre-trained features, VGG16-fc1 yields superior performance with minimum loss. However Xception converges in lesser number of epochs compared to other models.

Performance evaluation of the proposed blended multi-modal features
A clue from the experiments on uni-modal features is that different uni-modal features extract different sets of features from the retinal images. If we can use multiple deep features extracted from different models, they complement each other and helps to improve the scores. To get benefited from more than one set of uni-modal features we propose a blended multi-modal feature representation. This section is dedicated to show the representative power of the proposed feature representation with an application to DR identification and severity level prediction.
In addition we apply the proposed pooling methods to blend the features from multiple pre-trained models. Initially we blend features from first and second fully connected layers of VGG16. Then we extend this to fusion of 3 different features from fc1, fc2 layers of VGG16 and Xception.

Blended Multi-Modal deep features for task1
We experiment the effect of blending deep features extracted from multiple pre-trained models on DR identification task. In addition we verify the proposed maximum, sum and average pooling approaches to blend multiple deep features. From Tables 8, we can observe that average pooling based fusion works better for DR Detection compared to other models. Using average fusion the models trained on multi-modal features leads to superior performance in terms of accuracy and kappa static. In addition the model converges faster in less than 50 epochs and attains minimum loss. The accuracy obtained by model trained using multi-modal features is significantly better compared with to those trained on uni-modal features.

Blended Multi-Modal deep features for task2
From the previous experiments we understand that the models trained on multi-modal features give better performance compared to those trained on uni-modal features in the context of DR identification which is simple binary task. To understand that the proposed blended performs efficiently for more complex multi-class classification task, we apply the proposed feature representation for severity prediction task. From Table 10, we can see that average pooling based fusion of multiple deep features works better for Diabetic Severity Prediction. Compared to the blended features from VGG16-fc1 and VGG16-fc2, blended features from VGG16-fc1, VGG16-fc2 and xception gives better representation. For severity prediction also, the model that uses average pooling approach for fusion converges faster with better accuracy and kappa score when compared with other approaches for fusion.

Comparison of proposed Blended feature extraction with existing methods
In this experiment we show the effectiveness of the proposed DNN with dropout at the input layer trained using the proposed blended multi-modal deep feature representation. with the existing models in the literature for DR prediction. We compare the proposed model with the performances of the models used in [43] and [44]. From Table 10 we can see that the proposed method gives an accuracy of 80.96% which is significantly better than existing models in the literature. When compared to the existing models proposed DNN model is simple with only 3 hidden layers with 512, 256, and 128 units each hidden layer. Confusion matrix in Figure 8 shows the mis-classifications produced by the proposed model when applied for DR severity prediction task. From the figure we can see that most of the proliferate DR type images are predicted as moderate.
As the final feature vector is a blended version of the local and global representations of the retinal images the final representation provides strong features. The reason for improvement in the performance of the proposed model is that each feature map of the final convolution block of VGG16 learns the presence of different lesions from the retinal images and Xception Net comprehensive representation of the entire retinal scan. When we combine the deep features from VGG16 and Xception gives a compact representation that gives the wholistic representation of DR images.

Conclusion
Major objective of this work is to acquire a compact and comprehensive representation of retinal images as the feature representations extracted from retinal images significantly influence the performance of DR prediction. Initially we extract features from deep pre-trained VGG16-fc1, CGG16-fc2 and Xception models. VGG16 model learns the lesions and Xception learns the global representation of the images. Then the features from multiple ConvNets are blended to get final prominent representation of colour fundus images. The final representation is a obtained by pooling the representations from VGG16 and Xception features. A DNN model trained using these blended features for the task of Diabetic Retinopathy severity level prediction. The proposed DNN model with dropout at the input avoids over-fitting and converges faster. Our experiments on benchmark APTOS 2019 dataset shows the superiority of the proposed model when compared to the existing models. Among the proposed pooling approaches, average pooling used to fuse the features extracted from the penultimate layers of multiple pre-trained ConvNets gives better performance with minimum loss in fewer epochs compared to others.

Conflicts of Interest:
The authors declare no conflict of interest.