1. Introduction
Recently, retinal image segmentation technology has been used in the diagnosis of cardiovascular and ophthalmic diseases [
1], among which cataracts, glaucoma, and diabetic retinopathy (DR) are the main diseases that cause blindness [
2]. Accurate segmentation of retinal images is an important prerequisite for doctors to perform professional diagnosis and prediction of related diseases. In recent years, many domestic and foreign researchers have used computers to automatically segment medical images. The existing methods of segmenting retinal vessels are mainly divided into two categories: unsupervised methods and supervised methods [
1].
The unsupervised methods learn based on the inherent properties of blood vessels, so do not need to refer to manually annotated samples. In [
3], a unique method combining blood vessel center-line detection technology and morphological plane slice technology was proposed to extract blood vessel trees from retinal images. In [
4], B-COSFIRE filter was used to automatically segment blood vessels. In [
5], a new multi-scale vessel enhancement method based on complex continuous wavelet transform (CCWT) was proposed, which uses histogram-based adaptive thresholding method and appropriate length filtering method to segment vessels. In [
6], on the basis of changing the length of the basic line detector, linear detectors at different scales were realized, and the final segmentation was produced for each retinal image. The work in [
7] proposed an entropy filtering algorithm based on curvature evaluation and texture mapping to segment color fundus blood vessel images. In [
8], they used an improved multi-scale detector to combine all responses on different scales to perform blood vessel segmentation by setting different weights for each scale. In [
9], a hybrid method for retinal vessel segmentation was proposed. This method uses normalized Gabor filtering to reduce background brightness changes and alleviate the false positive problem caused by light changes. In [
10], they used mathematical morphology to smooth the retinal image, and then used the k-means clustering algorithm to segment the enhanced image. The work in [
11] used the area growth method of automatic seed point selection to extract blood vessels from the image background. In [
12], gray conversion based on principal component analysis (PCA) and contrast-limited adaptive histogram equalization (CLAHE) was used for retinal image enhancement, and a new matched filter was designed to segment the blood vessels.
Differently from unsupervised methods, supervised methods need to refer to manually annotated samples to build models. Supervised methods can be further divided into two categories, shallow learning methods and deep learning methods. (1). The methods based on shallow learning use handcrafted features for segmentation. In [
13], they used a discriminatively trained, fully connected conditional random field model to segment retinal vessels. In [
14], based on multi-scale enhancement filtering and Gabor filtering, blood vessels were extracted on multiple scales and in multiple directions. In [
15], an AdaBoost classifier was constructed through iterative training for retinal vessel segmentation. The work in [
16] proposed a new supervised retinal vessel segmentation method. A group of very robust features from different algorithms was merged into a mixed feature vector for pixel feature description, and then the mixed feature vector was used to train the random forest classifier. (2). Compared with methods based on shallow learning, methods based on deep learning train a large number of data samples to extract features for segmentation. In [
17], a method of blood vessel segmentation based on a neural network was proposed. In [
18], the samples were enhanced by global contrast normalization, zero phase whitening, geometric transformation, and gamma correction. A deep neural network was used to segment the blood vessels. In [
19], a neural network based on the convolutional neural network structure was developed. In [
20], the deep convolutional neural network was used to automatically segment the blood vessels in the OCT-A image, and each pixel was divided into blood vessel type or non-vessel type. In [
21], they developed and trained a convolutional neural network that simultaneously segmented the optic disc, fovea, and blood vessels. The work in [
22] proposed a transfer learning supervision method based on a pre-trained fully convolutional network. In [
23], better preprocessing technology and multi-layer/multi-scale deep supervision layers were used to appropriately segment retinal blood vessels. In [
24], a deep convolutional neural network (CNN) for segmentation of retinal blood vessels was proposed. This method improved the quality of the segmentation of tiny blood vessels. In [
25], they improved the U-Net network, which contained densely connected convolutional networks and a new attention gate model for automatic segmentation of retinal blood vessels. In [
26], an attention-guided network (AG-Net) was proposed to save structural information, eliminate noise, and segment blood vessels, optic discs, and optic cups. In [
27], a residual connection was added to the convolutional neural network, and a deep learning architecture for fully automatic blood vessel segmentation was proposed. Deep learning has also been applied in other areas. In [
28], the author constructed a deep genetic cascade integration method based on different support vector machine (SVM) classifiers for credit scoring. In [
29], a deep genetic hierarchical network for credit scoring was proposed. In [
30], two end-to-end deep neural network (DNN) models for ECG-based authentication were proposed. The two models were a convolutional neural network (CNN) and a residual convolutional neural network (ResNet) with an attention mechanism called ResNet-Attention. In [
31], a novel ResNet-based signal recognition method was presented.
Among these methods, the traditional methods require prior knowledge and additional preprocessing to extract the manually annotated feature information, cannot obtain deeper feature information, and are susceptible to low-quality images and pathological areas. In the deep learning methods which have been proposed, the following problems usually exist: (1) Deep learning methods often require a large number of samples for learning, and there are few fundus images annotated by doctors. (2) The single input size of the model cannot obtain the features of the multi-scale image, leading to the low robustness of the model. (3) When using convolutional neural networks to learn fundus image features, convolution and pooling operations also filter out some useful information. This results in the loss of feature information of a large number of tiny blood vessels in the retinal image, which cannot be restored during the up-sampling process. (4) Up-sampling of the model makes it difficult to restore low-level detailed feature information, resulting in lower accuracy and more background noise.
To solve the above problems, this paper proposes a retinal vessel segmentation model based on convolutional neural network. Our work mainly includes:
We propose a multi-scale residual attention network (MRA-UNet) model to automatically segment retinal vessels. We add multi-scale inputs to the network, and use the residual attention module in the down-sampling part of the network to improve the feature extraction ability of the network structure. This improves the robustness of the model and reduces the excessive loss of micro-vascular feature information.
In MRA-UNet, we propose a bottom reconstruction module, which combines the output of the residual attention module in the down-sampling and aggregates the output information of the down-sampling to further enrich the contextual semantic information. It eases the problem of information loss in model’s down-sampling process.
The spatial activation module is added to the output part of the up-sampling. This module can further activate the small blood vessels in the fundus image, while restoring the image. It also effectively highlights the end of the blood vessel and the boundary information of the small blood vessels.
The remainder of this paper is organized as follows.
Section 2 describes the proposed method in detail, including the network backbone structure, residual attention module, bottom reconstruction module, and spatial activation module.
Section 3 introduces the image datasets, experimental parameters, and evaluation indicators. In
Section 4, we discuss and compared our experimental results. Finally, a conclusion is drawn in
Section 5.
4. Experiment Results and Analysis
4.1. Comparison of Results before and after Model Improvement
In order to validate the effectiveness of the residual attention module (RA), bottom reconstruction module (BR), and spatial activation module (SA) proposed in this paper, experiments were carried out on DRIVE and CHASE datasets. MUNet represents the U-Net with multi-scale inputs. MUNet+RA represents that the model adds the RA module to the multi-scale inputs of U-Net. MUNet+SA represents that the model adds the SA module to the multi-scale inputs of U-Net. MUNet+RA+BR represents that the RA and BR modules are added to the multi-scale inputs of the U-Net model. MUNet+RA+BR+SA represents that the three modules of RA, BR, and SA are added to the multi-scale inputs of the U-Net model.
The experimental results of the five models on the DRIVE dataset are shown in
Table 1. The bolded number in the table is the highest value of the corresponding indicator in this paper. In order to validate the effectiveness of introducing residual attention module (RA) and spatial activation module (SA), we respectively compared the network performance with RA and SA modules and without RA and SA modules. It can be seen from the experimental results in
Table 1 that the models with the residual attention module and the spatial activation module added respectively had higher accuracies and F-measures than the baseline model MUNet. The introduction of RA and SA modules can enable the network to learn features better. RA allows the network to focus on extracting features during down-sampling, alleviating the problem of information loss caused by convolution and pooling operations. The SA module can effectively promote the recovery of small blood vessels at the edge during up-sampling.
Adding the residual attention module and the bottom reconstruction module to the MUNet model (MUNet+RA+BR) caused great improvements in accuracy and F-measure over the baseline model MUNet—in particular, the F-measure was 2.1% higher than MUNet’s. After adding the three modules to MUNet, the accuracy of the model had a slight decrease, but the sensitivity, F-measure, and AUC were increased by 4.46%, 0.43%, and 0.04% compared with MUNet+RA+BR, respectively.
Table 2 shows the experimental results of the five models on the CHASE dataset. Like the experiment on the DRIVE dataset, in order to validate the effectiveness of the RA and SA modules, we added the RA and SA modules to the MUNet respectively.
From the experimental results in
Table 2, we can see all the indicators of the model after adding the RA and SA modules both improved, especially F-measure. In order to validate the effectiveness of the BR module, we added RA and BR to MUNet. Compared with the MUNet+RA model, the accuracy and F-measure were increased by 0.11% and 1.43%, respectively. Finally, we added all three modules to MUNet for experiments. The experimental results show that although MUNet+BR+RA+SA had less specificity than MUNet+RA, it had the highest accuracy, sensitivity, F-measure, and AUC
among the five models.
To further prove the improved performance of the model, we propose a hypothesis: with the addition of RA, SA, and BR modules, the baseline model performs better. We conducted
p value analysis on accuracy with the DRIVE database, and the results are shown in
Table 3. MUNet:MUNet+RA represents the comparison between the MUNet model and the MUNet+RA model, and MUNet:MUNet+SA represents the comparison between the MUNet model and the MUNet+SA model. In the table, MUNet:MUNet+RA has a
p value < 0.05 for accuracy, indicating that the MUNet+RA model and the MUNet model have statistical significance. That is, the MUNet+RA model performs better than the MUNet model. In the table, MUNet:MUNet+BR+RA+SA has a
p value < 0.05 for accuracy, indicating that the MUNet+BR+RA+SA model performs better in retinal vessel segmentation than the MUNet model. The
p value analysis on the CHASE dataset is shown in
Table 4. MUNet:MUNet+BR+RA+SA has a
p value < 0.05 for accuracy, indicating that the MUNet+BR+RA+SA model performs better in the CHASE database than the MUNet model.
4.2. Model Parameter Quantity and Computation Time Analysis
The time complexity calculation formula of a convolutional neural network [
37] is as follows:
Here l is the index of a convolutional layer, and d is the depth (number of convolutional layers). n is the number of filters in the l-th layer. n is also known as the number of input channels of the l-th layer. s is the spatial size (length) of the filter. m is the spatial size of the output feature map.
In order to evaluate the size of our model more accurately, the Python built-in function summary used in this study calculated the parameter quantity of the model. There are two reasons: First, there will be errors in the time complexity of manual calculation. Second, many researchers use model parameter quantities to evaluate model size.
The total number of parameters of the model proposed in this paper is 223.54 MB.
Our model takes 0.8633 s to segment a complete image on DIRVE dataset, 0.9649 s to segment a complete image on CHASE dataset, and 1.18 s to segment a complete image on STARE dataset. The comparisons between the calculation time of the proposed model and those of other models is shown in
Table 5,
Table 6, and
Table 7. Our model is the fastest in segmenting fundus images from DRIVE and CHASE except for the SATRE dataset. This makes our model suitable for real-world clinical applications.
4.3. Evaluation of ROC and Precision Recall (PR) Curves before and after Model Improvement
In
Figure 5 and
Figure 6, we compare the ROC and PR curves of different models on the DRIVE and CHASE datasets. It can be seen from
Figure 5 that, after the RA module was added to the baseline model, the ROC and PR of the model (MUNet+RA) increased by 0.0136 and 0.0118, respectively. This was due to the attention mechanism and jump join in the RA module, which further extracted the semantic information in the down-sampling. With the addition of SA and BR modules, the model ROC and PR increased, which also proves the effectiveness of the three modules RA, SA, and BR.
On the CHASE database, the areas under the AUC and PR curves of the best model MUNet+RA+SA+BR were 0.9899 and 0.8326. Compared with the baseline model MUNet, the areas under the AUC and PR curves of the best model increased by 0.0089 and 0.0531. With the addition of the three modules of RA, SA, and BR, the effect of the baseline model becomes better. Finally, the model containing these three modules has the best results on the CHASE database.
4.4. Visualization Results with Different Methods
We compared the method proposed in this paper with the method proposed by Aslani [
16].
Figure 7 and
Figure 8 are the visualization results of the DRIVE and CHASE datasets, respectively. In
Figure 7, column (a) represents the original image; (b) column represents the ground truth corresponding to the original image; (c) column represents the segmentation result of Aslani [
16] and column (d) represents the segmentation result of the method proposed in this paper. In
Figure 8, column (a) and (b) are the same as
Figure 7, column (c) represents the segmentation result of R2U-Net, and column (d) represents the segmentation result of the method proposed in this paper. The retinal blood vessels segmented by the method of Aslani [
16] contain a lot of noise and the wrong blood vessel branches, and there are problems such as unclear segmentation of small blood vessels at the edge and blurred boundaries. Although the retinal blood vessels segmented by the R2U-Net method contain less noise, there are still problems such as blurred boundaries and unclear small blood vessels.
The method proposed in this paper used the residual attention module for down-sampling, which reduces the loss of feature information caused by pooling and convolution, so as to better extract the feature information of small blood vessels on the edge. The existence of the attention mechanism enables the network to distinguish the foreground and background regions well, so that the segmentation result contains less noise and is more accurate, especially for the segmentation of small blood vessels in the red box in the figure.
4.5. Comparison of Segmentation Results with Different Methods
In order to prove the effectiveness of our method for blood vessel segmentation, we used the previously proposed unsupervised and supervised methods to evaluate our method on three datasets, mainly using sensitivity, specificity, accuracy, F-measure, and AUC
evaluation indicators.
Table 5 and
Table 6 are the results of retinal blood vessel segmentation on the two standard datasets of DRIVE, CHASE.
Table 7 shows the segmentation results of 20 retinal vessels on the STARE database.
Table 8 shows the results on the STARE database trained and tested using the leave-one-out method. From the table, we can see that the segmentation results of supervised methods are generally better than those of unsupervised methods, and the AUC
of supervised methods is higher.
For the DRIVE dataset, this method had good results on evaluation indicators, except the specificity was not as high as that of Zhang [
26]. The blood vessels segmented by Aslani [
16] and others contain more noise, and the segmentation of small blood vessel branches at the edge is blurry. The neglect of the small branch blood vessels at the margins leads to low sensitivity and high specificity results. Our method used skip connections to reduce the information loss in the down-sampling process. Local residual learning allows the network to ignore less important information, such as background or low-frequency regions, and bypass noise regions. This can make the network pay attention to effective information. We also use the attention mechanism to increase the contrast between the background and the blood vessels, so that the network notices the small blood vessel branches in the fundus. Our method can also have a better recognition effect for the small edge blood vessel. Our method is 1.4%, 0.77%, and 1.91% higher in accuracy, F-measure, and AUC
than the latest method proposed by AG-UNet [
39].
For the CHASE database, because of the uneven background illumination of the sample images in the CHASE dataset, it is difficult to distinguish blood vessels from wider arterioles, which requires the model to have strong feature extraction capabilities. Compared with R2U-Net [
38], our method is 1.24%, 1.99%, and 0.84% higher in accuracy, F-measure, and AUC
, respectively. We reduced the loss of feature information by adding a residual attention module during the down-sampling process, and we used the bottom reconstruction module to further aggregate the down-sampling feature information. On this dataset, Jiang [
22] had the highest sensitivity, but that method was not as high as our method in terms of accuracy and AUC
.
On STARE data,
Table 7 shows the experimental results of the STARE dataset. The segmentation result F1 of our method was 84.22%, which is 0.34% higher than residual U-Net. In terms of sensitivity, the segmentation result of Samuel [
23] was higher than our method, and the specificity of the segmentation result of Atli [
27] is higher than that of the method proposed in this paper. However, our method has the highest F-measure, accuracy, and AUC
.
Table 5 shows the experimental results of the STARE dataset tested and trained with the leave-one-out method. The highest accuracy, sensitivity, specificity, F-measure, and AUC
are 98.62%, 93.42%, 99.47%, 88.65%, and 99.60%. The average accuracy, sensitivity, specificity, F-measure, and AUC
of our method on 20 fundus images are 97.63%, 84.22%, 98.73%, 84.22% and 99.18%.
Our method used skip connections to reduce the information loss caused by pooling and convolution in the down-sampling process of the network, so that the network can extract more feature information, and the edge small blood vessels can be segmented more accurately. The attention mechanism is used to improve U-Net, so that the network can better distinguish the foreground and the background, and effectively ignore the noise. The multi-scale inputs we added allow the network to learn feature information at different scales. The bottom reconstruction module further aggregates the information obtained after down-sampling to obtain feature information under different receptive fields. Finally, the spatial activation module used in the up-sampling process promotes the recovery of peripheral small blood vessels. By analyzing the results of DRIVE, CHASE, and STARE, it was proven that our method has good performance and robustness for the segmentation of fundus blood vessels.