Automatic Diabetic Retinopathy Grading via Self-Knowledge Distillation

: Diabetic retinopathy (DR) is a common fundus disease that leads to irreversible blindness, which plagues the working-age population. Automatic medical imaging diagnosis provides a non-invasive method to assist ophthalmologists in timely screening of suspected DR cases, which prevents its further deterioration. However, the state-of-the-art deep-learning-based methods generally have a large amount of model parameters, which makes large-scale clinical deployment a time-consuming task. Moreover, the severity of DR is associated with lesions, and it is difﬁcult for the model to focus on these regions. In this paper, we propose a novel deep-learning technique for grading DR with only image-level supervision. Speciﬁcally, we ﬁrst customize the model with the help of self-knowledge distillation to achieve a trade-off between model performance and time complexity. Secondly, CAM-Attention is used to allow the network to focus on discriminative zone, e.g. , microaneurysms, soft/hard exudates, etc.. Considering that directly attaching a classiﬁer after the Side branch will disrupt the hierarchical nature of convolutional neural networks, a Mimicking Module is employed that allows the Side branch to actively mimic the main branch structure. Extensive experiments are conducted on two benchmark datasets, with an AUC of 0.965 and an accuracy of 92.9% for the Messidor dataset and 67.96% accuracy achieved for the challenging IDRID dataset, which demonstrates the superior performance of our proposed method.


Introduction
Diabetic retinopathy (DR) is the predominant manifestation of diabetic microangiopathy, which is one of the complications of diabetes. It is reported that approximately one third of people with diabetes in the United States, Europe and Asia have some degree of DR [1]. It also the major leading cause of blindness and vision defects among working-age adults worldwide [2]. The traditional solution is to have a well-trained clinical ophthalmologist observe fundus imaging and subjectively assess the severity of DR. However, the scarcity of ophthalmologists hinders patients from receiving timely diagnosis and treatment, especially in underdeveloped areas, which eventually leads to irreversible vision loss. With this in mind, an automated computer-aided diagnostic (CAD) system is needed to assist ophthalmologists in the early screening of potential DR, alleviating their labor-intensive workload.
Early research mainly focused on hand-crafted features to represent images, which requires specific domain knowledge. Adarsh et al. [3] used image processing techniques to obtain anatomical and texture features, and then fed them into a multi-class support vector machine (SVM) for classification. In [4], an ensemble-based method for the screening of DR was proposed, which extracted The picture is kindly provided by Messidor database [17], no conflict of interest.
To address the above-mentioned issues, we use a large "teacher network" within the self-knowledge distillation (SKD) [18,19] to guide the compact yet efficient "student network" with only image-level labels, which allows custom pruning of the model according to the actual scenarios in the inference, as shown in Figure 2. Nevertheless, unthinking pruning will disrupt the hierarchical structure of the CNN, so we propose the Mimicking Module (MM) to mitigate it. L 2 loss allows alignment of the block-level outputs between Side branches and the main branch, shortening the spatial distance between them. Furthermore, the off-the-shelf CAM-Attention [20] facilitates the model to focus on discriminative regions (e.g., lesions), significantly improving the overall performance. For evaluation, we test our method on two publicly available datasets, the Messidor dataset and a new IDRID challenge dataset. Experimental results show that our method outperforms state-of-the-art methods on DR screening. In summary, our contributions of this paper are as follows: (1) A novel self-knowledge distillation framework is proposed for diabetic retinopathy image grading.
It can customize the pruning of the model according to the actual application scenario, which reduces the time delay while not significantly degrading the accuracy. (2) The introduction of CAM-Attention promotes the model to focus on pathological regions, and the Mimicking Module enables the model to maintain its original hierarchy while pruning. Experimental results confirm that the two proposed modules have a positive effect on the results. (3) The quantitative and qualitative results on the Messidor and IDRID datasets confirm the effectiveness of the methodology in this paper.
The remainders of this paper are organized as follows. The details of SKD-based DR grading method and its components are presented in Section 2. Section 3 gives experiments on benchmark datasets. Section 4 verifies the effectiveness of each component on the Messidor dataset. Finally, in Section 5, we draw some conclusions.  Figure 2. A detailed illustration of our proposed network. In addition to the main branch and the auxiliary attention branch, the proposed framework also has three Side branches attached. Among them, the red dotted box contains multiple groups of ResBlocks; AvgPool denotes the global average pooling layer; f c denotes the fully connected layer. For binary classification, the images of stage 0 and stage 1 in the Messidor dataset are combined as referable images, and the rest are non-referable images. Backbone and the Mimicking Module will be discussed in Sections 2.3 and 3.2, respectively. Best viewed in color. Figure 2 illustrates the overall flowchart of our DR grading method. Our goal is to design a self-knowledge distillation system that integrates scalability and flexibility, which transfers knowledge from an over-parameterized model to compact models, thereby reducing response time to efficiently assist ophthalmologists in the timely diagnosis of potential DR.

CAM-Attention Module
Although CNN architectures such as ResNet [6] have demonstrated their superior performance on a variety of visual-related tasks, Squeeze-and-Excitation [21] and CBAM [22] components show that by attaching channel or spatial attention components to the backbone, the network can imitate human visual behavior, i.e., focusing on decisive features to achieve outstanding performance gains. Recently, in [20], Fukui et al. extended a response-based visual explanation model named Attention Branch Network (ABN) by introducing attention and perception branches on the basis of Class Activation Mapping (CAM) [23]. Inspired by the work of ABN, we merge it to enhance the "teacher network" representation capabilities while focusing on the discriminative regions. Unlike ABN's approach, CAM-Attention is added after ResBlock4 instead of ResBlock3, which further reduces time consumption.
Given an input tensor X 0 i ∈ R 3×H 0 ×W 0 and its corresponding ground truth label y i ∈ {0, 1, ..., K − 1}, where i represents the i-th sample and K represents the number of predefined classes. The input tensor first passes through N convolution blocks Θ N n (·) to generate the feature extractor, where intermediate feature maps X n i ∈ R C n ×H n ×W n at the block n can be calculated as X n i = Θ n (X n−1 i ). Here, C n , H n and W n represent the number of channels, height and width of the n-th block, respectively. Then, a channel dot-product is performed between the feature extractor and the attention weights to obtain the output of the CAM-Attention X n i , which can be formulated as where Atten(·) denotes the spatial attention operation. Letŷ c i andŷ m i be the normalized output logits by Atten(X n i ) and X n i after passing through the attention branch and the main branch (sequentially traverse a global average pooling layer (GAP), several fully connected layers (FC) and a SoftMax layer, respectively). When the conventional cross-entropy loss L CE is used as the supervision signal, the loss of ABN is as follows where λ used to balance them. More details of ABN can be found in [20].

Self-Knowledge Distillation
Top-performing deep CNN architectures suffer from computational overload, which hinders their further porting to resource-constrained devices. As a trick of model compression, knowledge distillation (KD) [24,25] takes the prediction of probability distribution from a powerful but resource-hungry teacher model as the soft target, combined with one-hot labels to jointly regularize smaller models. However, the paradigm adopted by conventional KD is a two-step optimization, i.e., first training the teacher model and then allowing the learned knowledge flow progressively to the student model by mimicking the probability distribution of the teacher model's output, has the disadvantage of being too costly.
Recently, related work [18,19,26] has shown that teacher and student models can come from the same CNN network, and dynamically transfer knowledge by adding Side classifiers behind some intermediate layers, which is called self-knowledge distillation (SKD). In [27], Lee et al. pointed out that adding auxiliary (Side) classifiers allows the intermediate layer to obtain gradient flows from both the topmost and branch losses, alleviating the "gradient disappearance" problem that occurs in the back propagation of gradients caused by deeper networks, and accelerating the convergence. Letŷ s j denotes the j-th Side classifier output, SKD loss can be formulated as where β is the relative weight between the two loss terms, and L KL represents Kullback-Leibler (KL) divergence betweenŷ s andŷ m . Moreover, using a higher value of T in KL results in a softer probability distribution over classes [24].

Mimicking Module
For the first time, FitNets [28] introduces hints loss from the teacher hidden layers to guide the training process of the student. Nevertheless, due to the inherent hierarchical representation of CNNs [29] (shallow towards detail and deep towards semantics), blindly attaching classifiers to the middle hidden layers as described in [18] without thinking twice would disrupt this structure.
Our insight comes from the mimicking of teacher's teaching, i.e., students receive what they are taught through stage-wise learning, hence we propose a novel Mimicking Module (MM). More specifically, by attaching thinner (fewer bottlenecks) but the same number of ResBlocks as the main branch behind the Side branches, the block-level constraints of the main branch (teacher) are used to allow the Side branch (student) to reach block-alignment and hierarchical information sharing during mimicking, while reducing runtime. Below, we describe the mathematical formulation of it.
Let F l and F m be the intermediate layer outputs of the l-th branch and the main branch, respectively. Our optimization target is where || · || 2 2 refers to the L 2 norm loss and η is a tunable hyper-parameter. Combining Equations (2)-(4), the optimization objective of the entire network can be written as arg min where W stands for the weight matrices to be optimized.

Datasets Descriptions
Messidor. The Messidor dataset [17] contains 1200 color fundus images with DR and DME annotations, in which DR is classified into four classes according to the severity scale. For a fair comparison with previous works [5,12,[30][31][32], we treat images at levels 0 and 1 as referable and the remainder as non-referable, while using 10-fold cross-validation to verify the effectiveness of the model.
IDRiD. The IDRiD dataset [33] comes from ISBI-2018 Challenge 2 (https://idrid.grand-challenge. org/Grading), with a total of 413 training images and 103 test images. We divide it into five classes according to the organizer's rules, and refer the test set as the validation set to evaluate the experimental results.

Experimental Setup
Our experiments are conducted using Pytorch toolkit and trained on a single NVIDIA Tesla V100 GPU. By default, we use ResNet18 [6] as the backbone and optimize the network with Adam optimizer [34], accompanied by an initial learning rate of 0.0001. The number of ResBlocks for the three Side branches is configured as {1, 1, 2}, {1, 2} and {2}, respectively. For training, a total of 300 epochs for Messidor and 200 for IDRID, while the batch size is set to 40 for both datasets. Moreover, λ, β and η are empirically set to 0.4, 1, and 1 × 10 −7 respectively to ensure gradient equalization. We resize the original images to 224 × 224, while using simple data augmentation, such as horizontal and vertical flips to increase the diversity of the data. It should be noted that to overcome class imbalance, the number of samples for each class in the training batch is the same (using data re-sampling). In addition, for an analysis of general dataset, see Appendix A.1. Our code is available at: https://github.com/JACKYLUO1991/DR-Grading.

Results on Messidor Dataset
To evaluate our training strategy, we follow the SKD training procedure and report the results of the comparison between the main branch without/with the help of CAM-Attention and several existing methods on Messidor dataset.
As shown in Table 1, our method has superior performance over state-of-the-art methods in terms of AUC (Area Under the Receiver Operating Curve), Acc. (accuracy), Pre. (precision) and Rec. (recall a.k.a. sensitivity) metrics. The quantitative results can be summarized as follows: (1) our method improves the AUC metric by nearly 10% compared to the method [30] using laborious manual feature extraction; (2) compared with methods such as Zoom-in-net [31] that use additional data to improve performance, we still achieve outstanding results with only Messidor's annotations; (3) in contrast to CANet [12], which uses bulky ResNet50 combined with multitask learning, our method uses lightweight ResNet18 while increasing the AUC, accuracy and precision by 0.3%, 0.3% and 0.4% respectively, falling below the former only in the recall metric; (4) compared to the plain SKD, the SKD with CAM-Attention has significantly improved the performance, such as AUC (0.959 vs. 0.966) and Acc. (91.7% vs. 92.9%) metrics, which reflects the positive effect of focusing on pathological regions over outcomes. From a statistical perspective, we give a 95% confidence interval (CI) for AUC, which ranges from 0.953 to 0.979. Table 1. Performance comparisons on Messidor dataset. Results are given as the mean or (mean ± std) of 10-fold cross-validation. " †" shows that its results are reproduced from [12] and the remaining values are copied from original papers.

Method AUC Acc. (%) Pre. (%) Rec. (%)
Pires et al. [30] 0.863 ---VNXK/LGI [32] 0.887 89.3 --CKML Net/LGI [32] 0.891 89.7 --CANet [12] 0.895 81.0 --Comprehensive CAD [35] 0.910 ---DSF-RFcara [5] 0.916 ---Expert [35] 0.940 ---Multitask net [ Figure 3 shows the heatmap generated by the last convolutional layer supported by Grad-CAM [38], where the red highlights indicate regions that the model considers decisive for diagnosis.   Table 2 summarizes the comparison between our method and those proposed by other challenge participants. To the best of our knowledge, since the competition covers both DR and DME, ref. [39] is the only non-competition research result that provides independent DR grading so far, we list its quantitative results as well. It should be pointed out that our method does not use additional data for pre-training or relying on model ensembles like other solutions.

Results on IDRID Dataset
As can be seen, in the third column of Table 2, our method outperforms the other methods at a smaller input scale, just lower than the solution of Lzyuncc [33] with an input scale of 896 × 896. Moreover, the SKD-refined student (Side branch 1) in the second line has the same classification accuracy (67.96%) as the teacher (main branch), while significantly cutting down the number of parameters, which further confirms the efficiency of our proposed method. Table 2. Comparison with state-of-the-art results on IDRiD dataset. Our results are bolded in blue. * indicates that the result is obtained from [33]. Consistent with the official evaluation criteria, only the accuracy indicator (unit: %) is given here. The related confusion matrix of multi-class DR grading is also given which is illustrated in Figure 4. Looking at the confusion matrix, each class is most likely to be predicted correctly except for class 1, which is predominantly classified as class 0. Thus, class 1 is the most difficult to distinguish and its data labeling also confuses experienced ophthalmologists. This problem can potentially be mitigated by using a more powerful network. In [40], Sokolova et al. gave a comprehensive performance measures for classification tasks. Among them, F1, as the harmonic average of precision and recall, is the most commonly used criterion for multi-classification problems. Mathematically, it can be formulated as:

Rank Method
where m is the number of classes, P i and R i denote precision and recall for class i, respectively. Finally 59.98% of F1 can be obtained by calling the scikit-learn library.

Ablation Studies
Here, fold1 in Messidor dataset is taken as an example to construct ablation experiments for several factors to measure their contributions towards our remarkable results. We regard the main branch without CAM-Attention (CA) as the baseline, and then add the CA module and the MM module in order, as well as training with SKD.
From Table 3 we can conclude that (i) the addition of the CA module promotes the model to focus on pathological features, and the accuracy is improved by 1.66% relative to the baseline, which is also consistent with the conclusion in Section 3.3; (ii) all the four classifiers outperform the baseline in terms of accuracy by virtue of the MM's hierarchical mimicry mechanism and co-optimization, which leverages weight sharing to facilitate performance of the primary task, and (iii) with the help of SKD, students progressively approximate the distribution of the teacher. In particular, the performance of Side branch 1 (94.17% with SKD) is equivalent to that of the teacher without SKD, which verifies the effectiveness of SKD's training strategy.

Efficiency of the Network
One way to get the speed of a model is to simply calculate how many computations it does, which also eliminates the performance difference caused by a specific GPU model or other equipped hardware resources. We typically count this as FLOPs (floating-point operations) and it is inversely proportional to time-consuming.
As shown in Figure 5, the dynamic adaptation of the inference time can be achieved according to the actual scenario, which highlights the advantages of SKD. In particular, compared to the main branch, the Side branch 1 has increased operating efficiency by 1.4 times. It is reasonable to believe that this gap can be more prominent when the model is deeper and the teacher-student block-level compression ratio is increased. Moreover, this technology can be further applied to portable devices to improve their execution efficiency.

Discussion on Free Parameters Selection
Currently, for the free parameters in Equations (2)-(4), we empirically assign them, with the core idea of unifying the loss terms to the same order of magnitude. Specifically, different loss functions in the same task have very different scales, so it is necessary to consider unifying these scales with weights. Generally, the gradient size of different loss functions is different in the process of model convergence, and the sensitivity to different learning rates is also differentiated. Adjusting different losses to the same order of magnitude can prevent the loss of small gradients from being dominated by the loss of large ones, so that the learned features have better generalization ability. However, the limitation of manual tuning is that it requires repeated trial and error to obtain the optimal value, and the process is usually very cumbersome. Moreover, the results are often sub-optimal.
There are two works worthy of further investigation: one comes from the literature [41], and its basic idea is to estimate the uncertainty of each loss item. Specifically, each loss is divided by the uncertainty, which is basically equivalent to automatically reducing the weight of the corresponding loss. The other comes from an open source project (https://github.com/ultralytics/yolov5) that uses genetic algorithms to search for parameters, which is more efficient than grid search. Since our work focuses on the proposed SKD distillation method, we will search for free parameters as the direction of future work.

Conclusions
In this paper, for diabetic retinopathy grading, we first introduce the CAM-Attention that allows the model to focus on discriminative regions to obtain a powerful teacher network. Then, a training strategy called self-knowledge distillation (SKD) is presented, which enables dynamic adjustment of inference time while improving performance. Finally, considering that attaching classifiers directly after the sharing layers would disrupt the hierarchical consistency between the teacher and students, we propose a Mimicking Module. Experimental results demonstrate that the proposed SKD could boost the performance of the student significantly.
This work can be further applied to resource-constrained devices, e.g., mobile phones, to reduce model inference latency without significant performance degradation. In addition, for automatic medical image screening, our work can relieve the fatigue of ophthalmologists while quickly obtaining diagnosis results. On the other hand, the limitation of this research lies in the optimization of hyper-parameters, which is currently optimized only by manual tuning. Our next work will introduce the genetic algorithm mentioned in [8] to select hyper-parameters. In our future work, we will also focus on semi-supervised as well as weakly supervised learning to eliminate the system's strong dependence on label data, while using graph neural network (GNN) for modeling.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: To further confirm the generalizability of our proposed method, we construct experiments on CIFAR-100 dataset [42]. CIFAR-100 contains 50,000 training sets and 10,000 test sets for a total of 100 classes, with an image size of 32 × 32 pixels. For the fairness of the experiment, data preprocessing, training parameters selection and hyper-parameters configuration are carried out according to [18]. DR classification transfer is done based on ResNet18 and is regarded as the baseline. The left value of the slash comes from [18], and the right value is the result we reproduced.

DR
From Table A1, we can see the advantages of the method in this paper, especially on Side branch 1, which can achieve a balance between speed and accuracy. The method in [18] directly attaches a classifier after a certain Side branch, which on the one hand destroys the consistency of hierarchical features in CNNs, and on the other hand causes significant performance degradation due to cutting off most of the high-level feature layers. In contrast, our method brings considerable gains (especially an increase of 8.61% in accuracy on Side branch 1), reflecting the powerful feature mapping ability of the Mimicking Module. In addition, multiple branches have been improved in terms of accuracy, indicating that adding CAM-Attention improves the performance of the teacher (main branch) to assist students (Side branches) in learning. These experiments confirm the importance of the knowledge interaction process in promoting the efficiency of sub-branch and improving the baseline performance of a single CNN model.