Lightweight Frequency Recalibration Network for Diabetic Retinopathy Multi-Lesion Segmentation

: Automated segmentation of diabetic retinopathy (DR) lesions is crucial for assessing DR severity and diagnosis. Most previous segmentation methods overlook the detrimental impact of texture information bias, resulting in suboptimal segmentation results. Additionally, the role of lesion shape is not thoroughly considered. In this paper, we propose a lightweight frequency recalibration network (LFRC-Net) for simultaneous multi-lesion DR segmentation, which integrates a frequency recalibration module into the bottleneck layers of the encoder to analyze texture information and shape features together. The module utilizes a Gaussian pyramid to generate features at different scales, constructs a Laplacian pyramid using a difference of Gaussian filter, and then analyzes object features in different frequency domains with the Laplacian pyramid. The high-frequency component handles texture information, while the low-frequency area focuses on learning the shape features of DR lesions. By adaptively recalibrating these frequency representations, our method can differentiate the objects of interest. In the decoder, we introduce a residual attention module (RAM) to enhance lesion feature extraction and efficiently suppress irrelevant information. We evaluate the proposed model’s segmentation performance on two public datasets, IDRiD and DDR, and a private dataset, an ultra-wide-field fundus images dataset. Extensive comparative experiments and ablation studies are conducted across multiple datasets. With minimal model parameters, our approach achieves an mAP_PR of 60.51%, 34.83%, and 14.35% for the segmentation of EX, HE, and MA on the DDR dataset and also obtains excellent results for EX and SE on the IDRiD dataset, which validates the effectiveness of our network.


Introduction
Diabetic retinopathy (DR), now a prevalent medical issue among diabetics globally, stands as a primary cause of blindness in the working-age population [1].It is estimated that 93 million individuals worldwide are affected by DR [2,3].The condition is characterized by symptoms such as microaneurysms (MAs), hemorrhages (HEs), soft exudates (SEs), and hard exudates (EXs) [4,5], as illustrated in Figure 1, which are fundamental in ophthalmological diagnoses.
At present, there is no absolute cure for this condition.The most efficient approach entails early detection and intervention to control its advancement [6].In the clinic, ophthalmologists must manually examine lesions in fundus images to screen for DR, which is not only time-intensive but also influenced by the subjective judgment of doctors, posing a challenge to the reliability of detection.Therefore developing an automated method for lesion segmentation is pivotal in the diagnosis of DR.
There are considerable variations in the shape, size, and appearance of each lesion type among different individuals, and different lesions may also manifest similar features.MAs and HEs typically present lower intensities in images, while EXs and SEs tend to be brighter.The intra-class differences and inter-class similarities make the segmentation of DR an exceptionally complex endeavor.The diminished contrast and clarity between a lesion and normal areas challenge segmentation and detection approaches.
A variety of image segmentation methods have emerged, and of them, those based on deep learning outperform conventional machine learning techniques.Convolutional neural networks (CNNs) have gained widespread adoption due to their proficiency in extracting intricate layers of features from datasets.Unet and related models [7][8][9][10] are frequently utilized in medical image segmentation.Most of them focus on augmenting the architecture of the original U-Net by embedding attention mechanisms and diverse nonlinear functions within the convolutional layers.Applied in retinal fundus image segmentation, they often are confronted with distractions from inadequate feature extraction.To address this issue, some approaches propose utilizing different domains to extract distinct features.Li et al. [11] proposed using a high-frequency domain to specify retinal vessel segmentation, which addressed the issue of previous networks being too sensitive to low-frequency noise in fundus images.HGC-Net [12] enhanced vessel segmentation in fundus images by extracting high-frequency components to highlight vascular structures.Li et al. [13] proposed a network for free-annotated restored cataractous fundus images, which utilized high-frequency components extracted from fundus images to replace segmentation to preserve the retinal structure.However, these methods overlooked the influence of texture information bias in high-frequency domains on segmentation performance.To address these challenges, and inspired by [14], we introduced a model named LFRC-Net to simultaneously segment four types of DR lesions by initially incorporating an improved frequency recalibration module (FRCM) at the top of the encoder to capture a wider range of texture and shape characteristics of lesions.The Laplacian pyramid is integrated into the frequency recalibration module, where the low-frequency domain is taken to learn shape information and the high-frequency area to learn text-based features.The noise is diminished in the final segmentation results.Subsequently, a weighted combination is introduced to aggregate features across all levels of the Laplacian pyramid to obtain the weights through their significance to the segmentation result, which focuses more on features with rich information and suppresses noisy attributes through the global embedding of the channels.Finally, a residual attention module including channel attention and spatial attention is incorporated into the decoder, where the channel attention is used to identify features of interest and the spatial attention to pinpoint their locations.By doing this, the interference of irrelevant features is mitigated further, while the pertinent lesions are enhanced.

Methods
The proposed LFRC-Net is introduced in this section, which is illustrated in Figure 2, and the encoder-decoder architecture is taken as the backbone.A frequency recalibration module (FRCM) is placed at the bottleneck layer of the encoder, and the residual attention modules are integrated in the decoder.The process begins by randomly choosing centers in regions of interest within the original image to form 256 × 256 patches, which are then inputted into the encoder.These encoded feature maps are subsequently refined through the residual attention module (RAM) and further refined by the FRCM to adeptly filter out extraneous features.
The refined feature maps are directed to the decoder, which involves three stages, with a residual attention module for each.Up-sampling in the feature layer is taken by bilinear interpolation, while the skip connections are accomplished through 1 × 1 convolutions, batch normalization (BN) layers, and ReLU activation, where 1 × 1 convolutions effectively reduce the channel size of the feature maps from the encoder.The number of convolutions is denoted by K, which can be adjusted to the appropriate parameter size of the model.The segmentation of the four types of DR lesions is obtained through a 1 × 1 convolution layer followed by softmax activation.

Encoder Module
MobileNet V2 [15], pretrained on the ImageNet dataset [16], is taken as the backbone of the encoder of our model and consists of two modules: an inverted residual module with 1 stride and a down-sampling module with 2 strides.The incorporation of inverted residual structure is the most notable innovation in MobileNet V2 compared with MobileNet V1, which increases and decreases the dimensionality through 1 × 1 convolution to preserve more abundant channel information while simultaneously reducing the number of learning parameters.The inverted residual structure removes the nonlinear function during the dimension reduction phase, retaining only the 1 × 1 convolution, which not only preserves feature diversity but also enhances the expressive capability of the model.Unlike standard convolution, depth-wise convolution, a convolution operation used in convolutional neural networks, applies a filter to each input channel separately and then combines the results.This method can significantly reduce computational cost and the number of parameters.MobileNet V2 is illustrated in Figure 3, where c, n, s, and Dwise represent channel number, module number, stride, and depth-wise convolution, respectively, while t denotes the scaling factor of the channel number within the inverted residual module.

Frequency Recalibration Module (FRCM)
Retinal fundus images are more sensitive to variations in lighting than the ones from natural scenarios, which decreases the morphological features of lesions under different lighting conditions.Traditional convolutional neural networks (CNNs) tend to extract features based on texture rather than shape [17,18], which limits their ability to utilize useful low-frequency shape information [14].The frequency recalibration module (FRCM) is proposed to extract the features of both texture and shape, which is illustrated in Figure 4.
There are three components: a Gaussian pyramid, a Laplacian pyramid, and a frequency attention module.The input feature map is transformed into a Laplacian pyramid using the difference of Gaussian (DoG): diverse Gaussian kernels and variances are incorporated into Gaussian functions to generate two-dimensional Gaussian kernels and further form a Gaussian pyramid, and the Laplacian pyramid is generated by subtracting between different levels of the Gaussian pyramid.The computational formulas for both the Gaussian and Laplacian pyramids are, respectively, presented in Equations ( 1) and (2): where H, W, and C denote the height, width, and number of channels of the feature map, respectively, x ∈ R H×W×C the input feature map of the FRCM, l the l-th level of the Gaussian pyramid ranging from G 0 to G 4 , σ l the variance of the Gaussian pyramid in the l-th layer, i and j the two-dimensional Gaussian kernel coordinates, * the convolution operation, and LP l the l-th layer of the Laplace pyramid.The Laplacian pyramid, which consists of different frequency layers and a frequency attention module inspired by the ECA-net [19], is introduced in this paper.First, each layer of the Laplacian pyramid undergoes global max pooling to acquire global spatial information and generate weights for all input channels.Then, different from [14], onedimensional convolution with shared weights is used to capture the dependencies of channels for each layer instead of the fully connected layer, which can reduce the model parameter number and enhance inter-channel information interaction capabilities.The calculation for global max pooling and inter-channel dependencies of feature maps in each layer are shown in Equations ( 3) and ( 4), respectively.The feature maps for each channel are computed by multiplying the learned weights with the input channel features where H and W represent the height and width of the feature map separately, LP f l the f -th channel of the feature map in the l-th layer, GMP f l the output of LP f l through global max pooling, w f l the learned weight for the f -th channel of the l-th layer, 1DConv and σ the one-dimensional convolution and sigmoid functions, respectively, and L P f l the output obtained by LP f l .After recalibrating the feature maps of each layer, the frequency attention module fuses the features from all layers through 3D convolutional operations to produce the final output x ′ of the FRCM, which can be formulated as follows:

Residual Attention Module (RAM)
This paper introduces residual attention modules into the decoder to improve the segmentation performance of our LFRC-Net, as illustrated in Figure 5, which combines a residual block and a hybrid channel-spatial attention block.The residual block is incorporated to mitigate the issue of gradient vanishing, the channel attention block efficiently identifies and utilizes pivotal feature channels, and the spatial attention block concentrates on prominent local area features.
The features x ∈ R H×W×C are inputted into a residual block comprising a depthwise separable convolution layer, dropblock layer, BN layer, and ReLU activation function to obtain the output CL(X).Notably, the kernel size of both layers of depthwise separable convolution in the decoder is set to 3. Then CL(X) proceeds through a hybrid channel-spatial attention module with a channel attention mechanism and a spatial attention mechanism, as depicted in Figure 6.
In the channel attention mechanism, CL(X) is fed into average pooling and max pooling to capture the global information of each channel.Then, the channel attention weight vector is obtained by shared-weight one-dimensional convolution and sigmoid function and multiplied with CL(X) to obtain the output of the channel attention block CA(x).
In the spatial attention mechanism, the spatial location features of CA(x) are obtained by summing the results of average pooling and max pooling.The spatial attention weight vector is acquired through a 7 × 7 convolution and sigmoid function and multiplied with CA(x) to attain the ultimate output of RAM x 1 ∈ R H×W×C .The residual attention module is calculated from Equations ( 5)- (7).

Loss Function
The cross-entropy loss is taken to measure the error of segmentation for four DR lesions at the pixel level, and the formula is shown in Equation ( 8): where N represents the number of samples, M the number of classes, p ij the label for the j-th class of the i-th sample in the ground truth where 1 indicates belonging to the class and 0 indicates otherwise, and q ij the predicted probability for the j-th class of the i-th sample in the model's output.

Results and Discussion
We validate the proposed architecture on two public datasets, IDRID and DDR, as well as one local dataset of ultra-wide-field fundus images.All the experiments are performed on an NVIDIA GeForce RTX 2080 Ti GPU (NVIDIA, Santa Clara, CA, USA) with 12 G of video memory.

Datasets
The IDRID dataset [20]  The ultra-wide-field fundus images dataset is a local dataset collected and organized by the Xinhua Hospital affiliated with Shanghai Jiao Tong University School of Medicine and captured under a 200-degree field of view (FOV) with 261 fundus images of 3900 × 3072 and pixel-level annotations of EX and HE.In the experiments, all images are resized to 1300 × 1024.A total of 165 images are used for training and 96 for testing.During training and testing, these images are randomly cropped into patches of size 256x256, and 20% of the patches from the training set are randomly selected to serve as the validation set in each iteration.

Training Parameter and Evaluation
The experiments are conducted on Keras and optimized using the Adam (adaptive moment estimation) optimizer.Considering the stability of the model training process and the speed of the training, the initial learning rate is set at 0.001.According to the sizes, complexities, and need to reduce the risk of model overfitting of the IDRID, DDR, and ultra-wide-field fundus image datasets, the training epochs for the three datasets are 100, 150, and 100, respectively.Considering the scales of the datasets, as well as the convergence speed of the model and hardware memory usage, the batch sizes for training the three datasets are 8, 8, and 16, respectively.During the training process, our model with the least weight loss in the validation set is retained for testing.
In this paper, quantitative evaluation metrics include accuracy, recall, specificity, precision, and F1, and ROC (receiver operating characteristic) and PR (precision-recall) curves are plotted.For fundus image segmentation, we primarily focus on the area under the ROC and PR curves.The ROC curve has the false-positive rate (FPR) on the x axis and the true-positive rate (TPR) on the y axis.The larger the area under the curve (AUC-ROC), the better the model's performance at different thresholds.The PR curve has recall on the horizontal axis and precision on the vertical axis.The larger the area under the curve (mAP), the better the model's performance in predicting the positive class.

Experimental Results
Table 1 gives the quantity analysis results on IDRID.The ROC and PR curves on IDRID given by DRNet, FRCU-Net, and our network are illustrated in Figure 7. Results that are the best are indicated in bold black.The results of [22][23][24][25][26][27][28][29][30][31] are from the original papers, while DRNet [32] and FRCU-Net [14] are given by our own implementations on this dataset.Our method achieve the highest AUC_ROC of HE and SE among all methods.The AUC_ROC of MA is lower at 0.51% than DRNet [32].Additionally, the mAP of both EX and SE of our method rank first among all methods.The mAP for EX is 9.26% higher than [31] and 0.46% higher than [29].We have the highest accuracy and specificity for the segmentation of the four types of DR lesions, which improves segmentation performance.Compared to the second-ranked [29], the mAPs of EX and SE increase by 0.46% and 0.41% respectively.At the same time, the F1 score of EX ranks second among all methods, only 0.06% lower than [22].Table 2 presents the results of the segmentation on DDR.This paper contrasts the segmentation results of four DR lesions on DDR.The experimental results on HED [33], DeepLab v3+ [34], U-Net [35], and L-seg [36] come from RTNet [37], while [28,29,[37][38][39] are based on their own publications.In the comparative results, most methods only provide results for the mAP and AUC_ROC of four types of DR lesions.Therefore, we mainly discuss the results of these two evaluation metrics for different methods on this dataset.For the segmentation of EX and MA, the mAP (60.51% and 14.35%) and AUC_ROC (97.74% and 95.64%) of our model are the best among all models; the mAP of HE segmentation is 0.84% higher than the second-ranked UNet [35].Regarding the segmentation of SE, the AUC_ROC is significantly higher by 8.18% compared to RTNet [37].The ROC curves and PR curves on DDR of our network are shown in Figure 8.  Table 3 shows the quantity analysis results on the ultra-wide-field fundus images.For the segmentation of EX, the mAP, F1, recall, and other metrics of LFRC-Net are the best among all the compared methods; the mAP is 57.90%,F1 is 52.77%, and recall is 41.95%.Regarding HE segmentation, our approach achieves the best AUC_ROC, mAP, and recall and the second best F1 (2.27% lower than SA_Unet [40]).The ROC curves and PR curves on this dataset of different models are shown in Figure 9.We employ the lightweight MobileNetV2 as the encoder of our network to reduce training time and memory usage.Table 4 shows a comparison of the different lightweight model parameters on IDRID.As illustrated in the comparative results, our model boasts only 0.95 M parameters, the smallest number among all the methods evaluated.

Ablation
Ablation experiments are conducted to validate the effectiveness of the proposed RAM and FRCM on the IDRID, DDR, and ultra-wide-field fundus images datasets, which are presented separately in Tables 5-7.For Table 5 on IDRID, using only the baseline network, the results of the mAP, F1, accuracy, and sensitivity of four DR lesions are the worst.After introducing RAM, the mAP and F1 of MA increase by 15.02% and 14.92% separately, indicating that RAM focuses more on interested areas and better locates lesion regions.However, compared to solely introducing RAM, the F1 score of MA decreases by 2.45% after the implementation of the FRCM.This decline is partly due to the uneven distribution of MA lesions and their very small size, posing challenges for accurately locating the positions of MA and segmenting based on their shapes.In contrast to solely introducing the FRCM, simultaneously introducing RAM and the FRCM increases the F1 score of MA by 1.22%, which indicates that RAM can assist in accurately locating the positions of small lesions, thereby enhancing segmentation performance.Although the simultaneous introduction of RAM and the FRCM leads to a 2.48% decrease in the F1 score of SE, while the mAP increases by 12.16%.Such results suggest a significant improvement in prediction accuracy, making this performance loss acceptable.Furthermore, compared to the baseline network, the mAP of EX, HE, MA, and SE increases by 5.17%, 4.93%, 10.57%, and 21.45%, respectively, and the F1 also improves by 6.63%, 7.86%, 13.69%, and 17.27% after introducing these two modules, which demonstrates that the low-frequency domain is more effective in segmenting lesion areas and reducing the impact of texture information bias.Overall, our network achieves the highest AUC_ROC and mAP of EX and SE while the mAP and F1 of HE also reaches optimal levels, which indicates the beneficial effects of introducing RAM and the FRCM on enhancing lesion segmentation.
For Table 6 on DDR, the baseline network exhibits relatively low performance with F1 of four DR lesions: 43.97%, 21.57%, 7.98% and 25.71%.After the introduction of RAM, the F1 of EX, MA, and SE increases by 2.34%, 0.32%, and 0.68% separately.After solely introducing the FRCM, the F1 score of HE decreases by 3.36%.This is because of the lower image quality of the DDR dataset and the influence of lighting conditions, leading to an incomplete presentation of the shape of HE lesions in some images, thereby affecting the segmentation effectiveness of HE.However, following the adoption of the FRCM, the AUC_ROC, mAP, and accuracy of the four types of DR lesions are all optimal.Additionally, the F1 scores of EX, MA, and SE are also the highest.The conclusion that both modules contribute to mitigating interference from background information and enhance the segmentation abilities of our network is obtained through these analyses.
As is shown in Table 7, after introducing RAM, the AUC_ROC, mAP, and F1 scores for both EX and HE improve.Following the introduction of the FRCM and compared to solely introducing RAM, although the F1 score of EX decreases by 0.81%, the mAP increases by 5.47%.Such performance loss is deemed acceptable.Overall, with the introduction of both RAM and the FRCM, the mAP, accuracy, and recall for EX are optimal, and similarly for HE, the AUC_ROC, mAP, F1, accuracy, and recall are also the highest.This result reinforces the conclusion we mentioned above and demonstrates the effectiveness of our network in segmenting multiple DR lesions on different datasets.

Visualization
The visualization results of DRNet, FRCU-Net, and our network for four types of DR lesions on the IDRID dataset are shown in Figure 10.Red boxes denote the labelled lesion areas, while blue boxes indicate the segmentation results.The first and second rows display the segmentation of EX, the third and fourth rows for HE, the fifth and sixth rows for SE, and the seventh and eighth rows for MA.In the segmentation of EX, it is obvious that our network can segment lesion areas where DRNet [32] and FRCU-Net [14] fail.For HE, from the third and fourth rows, it can be seen that our network achieves more precise segmentation compared to other networks, reducing the possibility of mis-segmentation.For SE, our algorithm is also more robust and obtains the better segmentation results.This demonstrates that our network improves segmentation performance by simultaneously utilizing the shape and texture information of DR lesions, reducing interference from irrelevant features.As far as MA is concerned, there is not much difference between the three algorithms in terms of results, because MA occupies only a few pixels, which challenges the involved segmentation methods.For the visualization on the ultra-wide-field fundus images in Figure 11, the first and second rows show the segmentation results of EX and the third and fourth rows for HE. Green boxes represent lesion areas, and for clarity, red boxes provide magnified views of the green ones, while blue denote the segmentation results of the DR lesions.For EX, it is evident that our model accurately identifies the lesion areas.It can be observed in the third row that our network is also superior to other models even when the lesion areas are relatively scattered such as HE.Figures 12-14 demonstrate the segmentation results of our network on IDRID, DDR, and ultra-wide-field fundus image datasets separately.The first row shows the original images, the second row presents the ground truth, and the third row showcases the segmentation achieved by our model.In the segmentations depicted in Figures 12 and 13, our network simultaneously segments four DR lesions.Blue, red, green, and yellow represent the segmentation results of EX, HE, SE, and MA, respectively.Due to the inconsistent image quality and significant impact of lighting in the DDR dataset, its segmentation performance is not as good as that of the IDRID dataset.Figure 14 illustrates our network's segmentation on ultra-wide-field fundus images.Our network simultaneously segments both EX and HE with blue and red, respectively.The low contrast of these fundus images and the sparsity of lesions make the segmentation full of challenges.Furthermore, there is a lot of interference in the image, such as eyelids and eyelashes, while the region of interest is only a small part.Our approach achieves higherprecision segmentation and a more comprehensive delineation of regions by utilizing attention mechanisms despite these factors.

Potential Clinical Implications
DR is a leading cause of adult blindness.This research contributes to the early detection and intervention of DR to prevent the progression of the disease.By utilizing LFRC-Net for automatic segmentation and analysis of retinal images, our network can locate the lesion areas, which improves diagnostic accuracy and reduces the workload of ophthalmologists.Furthermore, in this paper, multiple experiments validate the effectiveness of LFRC-Net in DR lesion segmentation.Thus, LFRC-Net can be considered for integration into existing diagnostic workflows, which primarily include data acquisition and preprocessing, lesion detection and segmentation, report generation, and clinical decision support.

Conclusions
In this paper, we propose a novel architecture called LFRC-Net that combines a frequency recalibration module with a residual attention module to simultaneously segment four types of DR lesions.The outstanding performance of our network is attributed to the frequency recalibration module to reduce texture information bias and effectively leverage the shape information of lesions for precise DR lesion segmentation.Additionally, the residual attention module concentrates on the lesion regions, which further improves the segmentation performance.The experiments on three datasets demonstrate the effectiveness of our network on segmenting multiple DR lesions.
Despite the remarkable achievements of LFRC-Net in regular field-of-view retinal images, it still faces significant challenges when dealing with multiple lesions segmentation on ultra-wide-field fundus images characterized by poor quality, low contrast, sparse lesion distribution, and small lesion shapes.This will be a focal point for our future work.

Figure 1 .
Figure 1.Four categories of DR lesions.

Figure 3 .
Figure 3. Structure of the MobileNetV2 and its components.

Figure 7 .
Figure 7.The ROC and PR curves on IDRID of DRNet and our network.

Figure 8 .
Figure 8.The ROC and PR curves on DDR of our network.

Figure 9 .
Figure 9.The ROC and PR curves on ultra-wide-field fundus images of different models.

Figure 11 .
Figure 11.Visualized result of different models on ultra-wide-field fundus images.

Figure 13 .
Figure 13.Visualized results of our LFRC-Net on DDR.

Figure 14 .
Figure 14.Visualized results of our LFRC-Net on ultra-wide-field fundus images.
[21]launched at the 2018 Biomedical Retinal Image Challenge International Conference for segmentation and rating.For segmentation of EX, HE, MA, and SE, there are 81 color fundus images with size 4288 × 2848, with 54 as the training dataset and 27 as the testing dataset.In total, there are 81 EX annotations, 81 MA annotations, 80 HE annotations, and 40 SE annotations.In this paper, these images undergo cropping, zero-padding, and resizing to 640 × 640.Considering the limited number of images in this dataset, to prevent overfitting, data augmentation methods such as horizontal flipping, vertical flipping, random rotation, and contrast limited adaptive histogram equalization (CLAHE) are involved.Throughout the experiments, these images are randomly cropped into patches of size 256 × 256 to better extract local information.The DDR dataset[21]contains 757 Chinese DR images with sizes from 1380 × 1382 to 2736 × 1824, with 383 images for training, 225 for testing, and 149 for validation.These fundus images are provided with pixel-level annotation for EX, HE, MA, and SE if the image has this type of lesion.In total, there are 486 EX annotations, 570 MA annotations, 601 HE annotations, and 239 SE annotations.In this paper, all images are cropped and resized to 512 × 512, and CLAHE and random cropping into patches of size 112 × 112 are performed to enhance the data.

Table 1 .
Comparative segmentation of EX, HE, MA, and SE on IDRID.

Table 2 .
Comparative segmentation of EX, HE, MA, and SE on DDR.

Table 3 .
Comparative segmentation of EX and HE on ultra-wide-field fundus images.

Table 4 .
Comparison of the different lightweight model parameters.

Table 5 .
Performance comparison of different components of our network on IDRID.

Table 6 .
Performance comparison of different components of our network on DDR.

Table 7 .
Performance comparison of different components of our network on ultra-wide-field fundus images.