1. Introduction
Diabetic Retinopathy (DR) is one of the most common retinal vascular complications of diabetes mellitus, mainly manifested by high blood sugar that causes insidious and uninterrupted damage to the retinal blood vessels [
1]. DR could eventually lead to vision impairments that could potentially reach blindness in patients. However, DR has no obvious symptoms in its early stage and is not easily diagnosed. Once it progresses to an advanced stage, the condition becomes irreversible [
2]. Many patients have already missed the best time for treatment by the time they discovered their impaired vision. In developing countries, the incidence of diabetic retinopathy has become the second leading cause of low vision or blindness in the adult working population after hereditary [
3].
Ophthalmologists usually look for DR lesions in fundus images to diagnose the condition. In clinical practice, there are four types of abnormal lesions in DR: Micoraneurysms (MA), Hemorrhage (HE), Soft Exudate (SE), and Hard Exudate (EX) [
4]. Doctors can view the type, size, distribution, and number of the four lesions via manual segmentation to determine the disease’s progression [
5] (see
Figure 1). However, the high cost and limited physician resources have made it difficult for the manually segmented DR screening technology to become widespread on a large scale [
6]. With technology development, automatic lesion segmentation technology has gradually matured and been applied in clinical diagnosis.
The mainstream automatic DR segmentation techniques are divided into two types: non-deep and deep learning methods [
7]. The prior methods rely on manual feature extraction, mainly morphological processing, classification, and region-growing methods [
8,
9,
10]. However, these segmentation methods are not highly accurate; they also rely on manual feature extraction by professional doctors [
11]. Therefore, it is difficult to employ them in medical diagnoses. Deep learning methods use automatic feature extraction and skip the traditional feature engineering steps [
12]. This enables an end-to-end working approach and reduces the need for domain expertise. Fully convolutional neural networks (FCN) were the first deep learning methods applied for this purpose [
13]. Subsequently, the emergence of U-Net pushed deep learning methods to their pinnacle [
14].
Particularly, Noushin Eftekhari et al. [
15] proposed a convolutional neural network-based method for MA single lesion segmentation. They used a two-stage training strategy to first select lesion candidates. They then used Convolutional Neural Networks (CNN) to classify MA and non-MA. However, their approach was time-consuming and labor-intensive. Moreover, the segmentation accuracy was not adequately high. Mo et al. [
16] designed a Fully Convolutional Residual Network (FCRN) incorporating multi-level information for HE lesion segmentation. The FCRN could quickly and accurately segment lesions, avoiding extensive pre- or post-processing steps. However, the FCRN is only effective for EX lesions and not for the other three types of lesion segmentation. Li et al. [
17] produced the first private dataset DDR for multi-lesion segmentation and attempted multi-lesion segmentation tasks using PSPNet [
18] and DeeplabV3+ [
19]. Guo et al. [
20] developed an end-to-end multi-lesion segmentation network called L-seg, which is FCN-based. They also introduced a weighted fusion module and achieved promising results.
Although the neural network-based DR segmentation methods have achieved certain results, DR segmentation is still a challenging task due to the following reasons. First, four types of lesions exist in DR; they have high similarity between lesion classes and large intra-class distinctions. Moreover, the lesion sizes are extremely small and their shapes greatly vary; this makes effective feature extraction difficult. Second, the retinal structure is complex; it has anatomical structures such as blood vessels and optic discs, which are easily confused with lesions. Moreover, the retina’s size is much larger than that of DR lesions. It is a great challenge to segment small lesions with complex structures in retinal images. Third, medical images often involve patient privacy and high-cost manual annotation, resulting in a great lack of available public datasets. The quality of the available datasets is not ideal for adequate model training. Overall, it is necessary to further investigate the four lesion features of DR and propose a DR multi-lesion segmentation network with a wider application.
In this paper, we propose a Multi-Scale–Multi-Level feature (MSLF) fusion model to conduct the problem of DR segmentation. The proposed MSLF-Net contains a Multi-Scale Feature Extraction (MSFE) module and a Multi-Level Feature Fusion (MLFF) module. The former builds a pyramidal feature extraction block, which is used to extract multi-scale information and provide more comprehensive features for segmentation networks, and the latter facilitates multi-level feature learning of the same category by a cross-layer structure that allows lower-level features of the same category to participate and enhance the semantic classification of higher-level features. MSLF-Net effectively improves the model’s segmentation accuracy for DR using the above improvements and outperforms other comparable end-to-end models in segmentation.This paper’s innovations and research contributions are summarized as:
- 1.
This paper established effective pyramid feature extraction blocks, which can generate multi-scale features efficiently and enhance the segmentation of small lesions significantly.
- 2.
This paper proposed the cross-layer and multi-scale class response feature fusion method, which makes the multi-scale feature fusion more effective, can adequately meet the challenges of small or complex features, avoids the complicated preprocessing work, achieves a high-performance end-to-end segmentation network, and reaches an advanced level in DR lesion segmentation.
- 3.
This paper improved the image quality by preprocessing the field of view (FOV) mask images for the IDRID and e_ophtha dataset, so that the model can be trained and worked on small datasets, and to facilitate the future work of DR lesion segmentation.
The rest of this article is organized as follows.
Section 2 elaborates the data and methodological concepts employed in this work.
Section 3 records and describes the experimental results. Then,
Section 4 discusses the experimental results. Finally,
Section 5 concludes the paper.
3. Materials and Methods
In this section, the datasets and data processing methods are first introduced. Then, we give a brief overview of our proposed model, and elaborate on the components of the model. Finally, The methods for evaluating the performance of models are described.
3.1. Datasets
The proposed model MSLF-Net is evaluated on two public datasets, including IDRID [
28] and e_ophtha [
29]. The IDRID dataset comes from the 2018 Diabetic Retinopathy Segmentation and Grading Challenge, which is the only dataset with four kinds of DR lesions. IDRID consists of 81 cooler fundus images and lesion annotations with resolution of 4288 × 2848. In the experiment, data were divided into training and testing sets, 54 images for training and 27 images for testing. e_ophtha provides only two types of lesion annotations, including EX lesions and MA lesions. Its images have various resolutions such as 2544 × 1696, 1440 × 960, etc. e_ophtha_EX consists of 47 color fundus images, and we set 32 images for training and 15 for testing. e_ophtha_MA has 148 images. The training set and testing set are 20 and 8, respectively.
3.2. Image Preprocessing and Augmentation
Image preprocessing and augmentation aim to resize the dataset images and increase image count. The images of two datasets used in the paper had high resolution and several pixel sizes, which makes it difficult to use them as input in our model. Accordingly, we resized the images to 1440 × 960 pixels without deformation. To conduct model training on a big dataset, images were additionally augmented to enhance the dataset. We performed five random geometric transformations with a 50% probability: random vertical flip, random horizontal flip, random affine, random rotation, and random adjustments of brightness, contrast, saturation, hue in the range of 0.07.
In addition, only the circular area of the eyeball in the fundus image contains valid information, and the rest are filled with black pixels. In order to highlight the circular area of the eyeball and avoid irrelevant black pixels from affecting the network training effect, we imitated the fundus retinal datasets such as DRIVE [
30] to make mask images for each image of IDRID and e_ophtha. The image of IDRID, lesion annotation, and mask are shown in
Figure 2.
3.3. Model Architecture
3.3.1. Model Overall
The structure of MSLF-Net was inspired by U-Net and adopts a symmetric encoder–decoder structure. The encoder part uses the pre-trained Vgg16 [
31] to enhance the feature learning capability. The decoder uses transposed convolutional up-sampling. Such structure allows for effective feature integration. The model adds MFEF and MLFF to U-Net. The former fully extracts multi-scale information to enhance the model’s feature extraction capability. The latter enhances network segmentation accuracy through a cross-layer architecture that allows both lower- and higher-level features to participate in semantic classification. These two modules complement each other to achieve multi-scale feature fusion on the feature map of the same activation class. This ensures that the model has a stronger discriminative ability for lesions with variable scales and complex features. Moreover, it could obtain a finer segmentation capability. Additionally, we adopted the idea of deep supervision strategy [
32] to retain the original output layer in the U-Net decoder, which is used to suppress background pixels and provide detailed lesion localization and structural information to the model.The overall structure of MSLF-Net is shown in
Figure 3.
3.3.2. Multiscale Feature Module (MFEF)
The MFEF module’s purpose is to leverage the physical contour feature information from the network’s shallow layer and the deep semantic category information from its deep layer. In general, the network’s shallow layer captures low-level information, such as points, lines, edges, and textures, while the deep layer extracts semantic category information. Additionally, it has a small down-sampling multiple and high resolution, which is suitable for segmenting small lesions. On the other hand, the deep layer does the opposite. There are lesions of different sizes in the fundus images of DR. Predictions using a single resolution image are ineffective. We borrowed the FPN [
33] idea and designed a pyramidal multi-scale feature extraction module to collect the feature map information with different resolutions. To retain the extracted multi-scale feature information as much as possible, the MFEF module was designed with activation and convolution operations based on the idea of ResNet [
34], and the “add” method was used to fuse the convolved feature maps before and after the operations.
The structure is shown in
Figure 4. We passed each of the decoder module’s five convolution blocks through the MSFE module to obtain five new convolution blocks. The first four convolution blocks need to be up-sampled to the original map’s size via bilinear interpolation. The last convolutional block does not need to be up-sampled because its size is identical to that of the original one.
3.3.3. Feature Fusion Module (MLFF)
The specific operation of the MLFF module is to improve feature concatenate through feature channel rearrangement. The MLFF module was designed to avoid a brutal approach to directly fusing the multi-scale features extracted by the MSFE module, which can fully utilize the multi-scale information in a manner ensuring that the same-class activation features are fused.
In convolutional neural networks, there are two common methods of feature fusion. One operation is to add the feature maps element-by-element, such as Resnet, SSD [
35], etc. The other operation is to add the feature maps by dimension and then adjust the number of channels with convolution, such as Densenet [
36], U-Net, etc; this is shown in
Figure 5a,b.
The above two fusion methods are only suitable for shallow feature fusion but not for deep feature fusion with existing category attributes. For image segmentation tasks, the label map is processed by One-hot coding, and the category information is sequentially mapped to different channels using a serial number. Then, the supervised output layer’s channels are also sequentially activated by category. If these feature maps are directly summed or processed with splicing operations, the activated channel category information will be destroyed. To protect each channel’s category information, the MLFF module performs a slice, splicing across the convolutional layers, instead of the corresponding regular splicing in the convolution. We wish to constrain each channel of these multi-scale feature maps implicitly through the output layer to carry the most relevant information to the corresponding category. This guides the feature maps extracted by the MFEF module to converge to a category identity.
As shown in
Figure 5c, the same color images represent feature maps containing information of the same semantic category. The MLFF module classifies the channels of the convolution blocks by semantic information. The number of channels in each convolution block is identical to the number of categories, while each channel corresponds to the category activation of a class. In the feature fusion operation, the MLFF module stitches the channels of different convolutional blocks with the same category, according to the “Concatenate” connection method to highlight the information of different scales in the same category. The specific operations are as follows:
As shown in
Figure 3, the MSFE module generates five convolutional blocks, each with c channels:
Then, MLFF spliced the channels of the same category through a slice operation across convolutional blocks to generate
:
Finally, c-group 1*1 convolution is performed on
to get the output layer as shown in
Figure 5c:
3.4. Implementation Details and Experiment Settings
In MSLF-Net, the encoder consists of five convolutional blocks, while the decoder is composed of one transition block and five up-sampling blocks. The five convolution blocks in the encoder have the same structure as Vgg16 and use its pre-training weights on the ImageNet dataset [
37] to initialize. The structure of the decoder is the same as that of U-Net. The MFEF module compresses the channel of the decoder convolutional blocks to five. While the MLFF module rearranges and compresses the convolutional blocks processed by MFEF module according to the channel order to obtain the output. The detailed configuration of MSLF-Net is shown in
Table 1.
The loss function is a hybrid loss function, including the sweighted cross-entropy and dice loss functions. The loss function formula is as follows:
The weighted cross-entropy loss function formula is:
where,
M (
M = 5) and
c are the numbers of class and class, respectively.
stands for class-weight, and the value is set as 100, 100, 50, 30, 1 for MA, HE, SE, EX, and background, respectively, which is calculated by the ratio of the number of labeled pixels to the number of all pixels in the dataset.
y denotes the label value, and
p indicates the predicted probability.
The dice loss function formula is:
where,
,
, and
are the predicted probabilities of true positive, false negative, and false positive for each class, respectively;
c and
M have the same meaning as the above formula.
The hybrid loss function is:
where
is the weight ratio of two loss functions, which is set to 1.0. The influence of
on the segmentation performance of the model will be discussed in the ablation experiment.
We implemented the experimental architecture using PyTorch, and executed it on NVIDIA 3090 GPU. In the experiment, we trained models for 200 epochs using the SGD optimizer with momentum 0.9 and weight decay 0.0005. We adjusted batchsize to 2 and Initial learning rate to 0.01. In addition, we adopted the Poly strategy to dynamically adjust the learning rate during training. The calculation formula of the Poly strategy is as follows:
The epoch-wise learning rate curve and loss curve during training on the IDRID dataset are presented in
Figure 6.
3.5. Model Performance Evaluation
The number of background pixels in DR images is much larger than the number of lesion pixels. For high class-imbalance, the PR Curve is concerned with True-Positive predictions. Therefore, the PR curve is generally considered to be better than other indicators in this situation. To compare the segmentation performance of different models visually, AUPR is also introduced. AUPR is the value of the area under the PR curve. The closer the AUPR value is to 1, the better the model’s predictive performance. In addition, we calculated the mean value from 5 experiments as the final experimental result to evaluate the overall model performance for multi-lesion segmentation of DR.