A Multi-Scale Attention Fusion Network for Retinal Vessel Segmentation

: The structure and function of retinal vessels play a crucial role in diagnosing and treating various ocular and systemic diseases. Therefore, the accurate segmentation of retinal vessels is of paramount importance to assist a clinical diagnosis. U-Net has been highly praised for its outstanding performance in the field of medical image segmentation. However, with the increase in network depth, multiple pooling operations may lead to the problem of crucial information loss. Additionally, handling the insufficient processing of local context features caused by skip connections can affect the accurate segmentation of retinal vessels. To address these problems, we proposed a novel model for retinal vessel segmentation. The proposed model is implemented based on the U-Net architecture, with the addition of two blocks, namely, an MsFE block and MsAF block, between the encoder and decoder at each layer of the U-Net backbone. The MsFE block extracts low-level features from different scales, while the MsAF block performs feature fusion across various scales. Finally, the output of the MsAF block replaces the skip connection in the U-Net backbone. Experimental evaluations on the DRIVE dataset, CHASE_DB1 dataset, and STARE dataset demonstrated that MsAF-UNet exhibited excellent segmentation performance compared with the state-of-the-art methods.


Introduction
The retinal vascular system is a crucial component of the visual system, playing a key role in maintaining intraocular homeostasis and ensuring visual function.It serves as a vital actor in sustaining the normal functionality of visual tissues by regulating blood flow and adjusting vascular tension, ensuring an ample supply of blood and oxygen to visual tissues.Additionally, the retinal vascular system is involved in the regulation of intraocular pressure, which is essential for maintaining the morphology and structure of the eyeball [1,2].
Under normal circumstances, the retinal vascular network exhibits a highly organized distribution, including major vessels, such as the central artery and central vein, along with various branches and capillaries.This structured arrangement ensures a stable blood supply to the retina, establishing optimal conditions for the transmission and processing of visual signals.Nevertheless, in pathological conditions, alterations to this structure may occur, leading to the formation of vascular abnormalities.These abnormalities can result in issues such as insufficient nutrient supply and hypoxia, ultimately affecting the normal functioning of the visual system [2,3].
In the field of ophthalmology, the structure and function of retinal vessels play a crucial role in diagnosing and treating various ocular and systemic diseases.The diameter, reflectivity, curvature, and branching characteristics of retinal blood vessels are crucial indicators for various retinal and systemic diseases [4][5][6].A quantitative analysis of retinal vessels can assist ophthalmologists in detecting and diagnosing the early stages of certain severe conditions [7,8].
In recent years, methods utilizing artificial intelligence (AI) technology for medical image segmentation have garnered widespread attention [9].In medical image analysis, AI technology has become a potent tool, assisting doctors in diagnosing diseases quickly and accurately.Retinal vessel segmentation is a crucial task in medical image analysis.The manual segmentation of retinal vessels is a laborious and time-consuming task prone to inter-observer variability.With the aid of AI technology, particularly machine learning and deep learning algorithms, automated retinal vessel segmentation can be achieved, aiding ophthalmologists in better understanding and analyzing the morphology and structure of retinal vessels, thereby providing superior clinical decision support for ophthalmologists.
The advancements in medical image segmentation have primarily been driven by deep learning techniques.The well-known CNN architecture U-Net [10] has demonstrated excellent performance in medical image segmentation.However, with the increase in network depth, multiple pooling operations may lead to the problem of crucial information loss, and the handling of insufficient local contextual features caused by skip connections can affect the accurate segmentation of retinal vessels.To address these problems, we proposed a multi-scale attention fusion network (MsAF-Net) for retinal vessel segmentation.Our main contributions are as follows: (1) We propose a multi-scale feature extraction (MsFE) block to capture diverse scale information from low-level features, providing the network with richer contextual information.(2) We propose a multi-scale attention fusion (MsAF) block, which combines channel attention from low-level features and spatial attention from high-level features, enabling the network to comprehensively understand the content of the image.(3) Combining the MsFE block and MsAF block, we propose a novel model for retinal vessel segmentation.Experimental results on three datasets demonstrated that our proposed model exhibited strong competitiveness compared with other state-of-the-art methods.

Multi-scale Feature Extraction
Multi-scale feature extraction is a common technique used in various computer vision tasks.A popular method for multi-scale feature extraction involves using filters with different sizes or receptive fields for image convolution.One notable example of this is the Inception v1, v2, and v3 modules proposed by Szegedy et al. [11,12].
For the task of retinal vessel segmentation, Yang et al. [13] proposed a segmentation method based on U-Net that utilizes the inception module to replace the convolution operation in the encoder.Compared with traditional convolution operations, the inception module can extract features at multiple scales.Experimental results on two datasets demonstrated the superior performance of this method, showing its competitiveness.Shi et al. [14] proposed a novel segmentation method named MD-Net.MD-Net adopts a strategy of dense connections and multi-scale feature extraction, enabling the network to simultaneously focus on both local details and global features of the image, thereby better capturing the structure and morphology of retinal vessels.Additionally, this method enhances the network performance by effectively utilizing residual learning and multi-scale receptive field design.Experimental validation on multiple datasets demonstrated the performance of MD-Net, with results indicating high segmentation accuracy and performance metrics across different datasets, thus confirming its superiority and effectiveness in retinal vessel segmentation tasks.

Attention Mechanism
Vaswani et al. [15] introduced the Transformer architecture, allowing the model to attend to different parts of a sequence without being constrained by the sequence length, thus enhancing the model's ability to handle long-range dependencies.Subsequently, attention mechanisms have found widespread application in computer vision tasks [16][17][18][19][20][21][22][23][24].
Researchers have proposed many methods for retinal vessel segmentation that combine attention mechanisms with the U-Net backbone.Dong et al. [25] proposed a cascaded U-Net framework to progressively extract features at different hierarchical levels from images.They introduced a residual attention block, incorporating an attention mechanism to enhance the network's focus on crucial image regions, thereby improving the precision of vessel segmentation.The experimental results on the DRIVE dataset and CHASE_DB1 dataset demonstrated its outstanding segmentation performance.Guo et al. [26] introduced SA-UNet for retinal vessel segmentation.SA-UNet replaces the original convolutional blocks in the U-Net framework by incorporating DropBlock [27] and batch normalization.Additionally, a spatial attention module is integrated between the encoder and decoder.The segmentation performance of SA-UNet is excellent on the DRIVE dataset and CHASE_DB1 dataset.
However, despite the good performance exhibited by U-Net and its variants in the field of medical image segmentation, increasing the network depth may lead to issues such as the loss of crucial information due to multiple pooling operations and the inadequate handling of local contextual features caused by skip connections.These issues can have an impact on the accurate segmentation of retinal vessels.

MsFE Block
Multi-scale features have the capability to capture semantic information at different scales, providing richer contextual information.Therefore, inspired by [11,28], we employed the MsFE block to extract multi-scale information.As illustrated in Figure 2, the MsFE block consists of four parallel branches and a residual connection.The four parallel branches include convolutional operations of 1×1, 3×3, and 5×5, as well as a 3×3 max-pooling operation.After concatenating the outputs of the four parallel branches, a 1×1 convolution and sigmoid activation are applied.

MsAF Block
The channel attention mechanism focuses on adjusting the weights of different channels in the network's feature maps to enhance useful features and suppress those that are less relevant to the current task.On the other hand, the spatial attention mechanism is concerned with how the network prioritizes different spatial positions in the image, allowing for a selective emphasis on crucial areas.Inspired by [16,29], we simultaneously introduced both channel attention and spatial attention mechanisms, designing a novel MsAF block.As illustrated in Figure 3, we applied channel attention to low-level features and spatial attention to high-level features.The final step involved merging the features extracted from each attention mechanism, enabling the network to comprehensively understand the content of the image.We describe the detailed operation below.First, we defined the low-level features as X L and the high-level features after the upsample operation as X H .For the low-level features X L , global average pooling was applied to compress the features.Subsequently, two fully connected (FC) layers and an activation function were utilized to obtain channel-wise dependencies Y ′ L .This can be expressed as where f 1 and f 2 denote the fully connected layers, θ denotes a sigmoid function, and δ denotes a ReLU function.Then, the channel attention map can be represented as where Y L ∈ R H×W×C .
For the high-level features X H , we initially performed two pooling operations, concatenated the resulting two feature maps, and then utilized a 7 × 7 convolution to generate the spatial attention map Y ′ H .This can be expressed as where θ denotes a sigmoid function.Then, the final spatial attention map Y H can be represented as where Y H ∈ R H×W×C .We concatenated the obtained channel attention map X L and spatial attention map X H , performed a 3 × 3 convolution operation, and generated a multi-scale attention fusion feature map Y.This can be represented as follows: where Y ∈ R H×W×C , and f 3×3 represents the convolution operation.

Dataset
The experiments were conducted using the DRIVE dataset [30], the CHASE_DB1 dataset [31], and the STARE dataset [32].Table 1 displays the specific information of each dataset.Due to the small sizes of the three datasets, which may lead to overfitting, we utilized horizontal flips, vertical flips, rotations, addition of Gaussian noise, adjustment of brightness, and other methods to augment the data.After the augmentation, the training samples for the DRIVE dataset and CHASE_DB1 dataset reached 800 images each, and for the STARE dataset, there were 400 training samples.

Experimental Setup
All experiments employed the Adam optimizer, training was stopped after 200 epochs, and the input image size was uniformly resized to 448 × 448 × 3. The Dice loss was used as the loss function, with an initial learning rate of 0.001.Evaluation metrics comprised the sensitivity (SE), F1-score (F1), specificity (SP), accuracy (ACC), and the area under the receiver operating characteristic curve (AUC).
Table 2 displays the sensitivity, F1-score, specificity, accuracy, and AUC values of the proposed model and the baseline model on the DRIVE dataset.It can be observed that compared with the baseline models, although the specificity value of Unet++ was higher than that of the proposed model, the proposed model achieved the highest values in terms of sensitivity, F1-score, accuracy, and AUC.The comparison of the performance between the proposed model and state-of-the-art methods on the DRIVE dataset is shown in Table 5.Compared with the other advanced models, MsAF achieved the highest sensitivity value of 0.8611 and the highest F1-score value of 0.8383 on the DRIVE dataset, surpassing the highest values obtained by the other advanced models by 0.14% and 0.26%, respectively.The differences in specificity, accuracy, and AUC compared with the highest values obtained by the other advanced models were 0.19%, 0.72%, and 1.15%, respectively.The comparison of the performance between the proposed model and state-of-theart methods on the CHASE_DB1 dataset is shown in Table 6.Compared with the other advanced models, MsAF achieved the highest sensitivity value of 0.8630 and the highest F1-score value of 0.8435 on the DRIVE dataset, surpassing the highest values obtained by the other advanced models by 0.57% and 0.85%, respectively.The differences in specificity, accuracy, and AUC compared with the highest values obtained by the other advanced models were 0.46%, 0.34%, and 0.94%, respectively.The comparison of the performance between the proposed model and state-of-theart methods on the CHASE_DB1 dataset is shown in Table 7.Compared with the other advanced models, MsAF achieved the highest sensitivity value of 0.8630 and the highest F1-score value of 0.8435 on the DRIVE dataset, surpassing the highest values obtained by the other advanced models by 0.48% and 0.20%, respectively.The differences in specificity, accuracy, and AUC compared with the highest values obtained by the other advanced models were 0.91%, 0.05%, and 0.81%, respectively.In order to visually observe the segmentation results more intuitively, this work introduced qualitative analysis for performance visualization.A sample image was selected from each respective test set, and the corresponding segmentation results are presented in Figures 5-7.As demonstrated in these figures, the proposed model exhibited satisfactory segmentation performance, showcasing its ability to effectively detect vessels in retinal images.
Examining the red box in Figure 5, it is evident that the proposed model accurately identified vessels in the retinal image, whereas the baseline model mistakenly identified a branching vessel at the same location, resulting in false positives.In the green box of Figure 5, it is apparent that all models failed to recognize a small vessel, leading to false negatives.
In Figure 6, the proposed model accurately identified vessels in the retinal image.In the red box of Figure 6, the baseline model erroneously identified a branching vessel, resulting in false positives.In the green box of Figure 6, Attention U-Net accurately recognized this vessel, while U-Net and Unet++ only identified a portion, resulting in false negatives.Examining the green box in Figure 7, it is observed that the proposed model accurately identified this vessel more completely.Unet++ recognized only a small portion, while U-Net and Attention U-Net almost failed to identify this vessel, resulting in false negatives.In the red box of the ground truth in Figure 7, representing a completely background area, all models recognized an incomplete vessel, leading to false positives.
Figures 8-10 display the differential images of segmentation results on a different dataset each.Through the analysis of the differential images, we can distinctly observe the distribution of false positives and false negatives across different models and datasets.Upon comparing the differential images of different models, as depicted in Figures 8-10, it is evident that the proposed model exhibited lower false positives and false negatives compared with the baseline model across all three datasets.Furthermore, contrasting the differential images of different datasets revealed significant variations in the quantities of false positives and false negatives.In the segmentation of the DRIVE dataset and CHASE_DB1 dataset, missegmentation of major vessels led to a higher occurrence of false positives and false negatives.Conversely, in the segmentation of the STARE dataset, major vessels were accurately segmented, with only minor vessels exhibiting no segmentation, resulting in lower false positives and false negatives.After an analysis, we attributed these differences to variations in image quality and annotation accuracy between the datasets.

Discussion
Due to the complex morphology of retinal vessels, which includes branching, bending, and irregular shapes, and the small proportion of vessels leading to severe class imbalance, retinal vessel segmentation poses significant challenges.Additionally, retinal fundus images are often affected by noise, resulting in a low image contrast.Therefore, the task of retinal vessel segmentation is highly challenging.
In retinal vessel segmentation tasks, false positives and false negatives can lead to different consequences.First, false positives may result in incorrect diagnoses or unnecessary treatments.If non-vessel regions are erroneously labeled as blood vessels, clinicians may mistakenly believe that abnormalities exist and proceed with unnecessary further examinations or therapies.On the other hand, false negatives may lead to overlooking lesions or abnormalities in the retina, such as vessel occlusion or abnormal vessel morphology.If crucial vessel regions are erroneously excluded, clinicians may miss important clues for diagnosing diseases, leading to delayed treatment or inadequate therapy.Therefore, reducing the occurrences of false positives and false negatives is crucial in retinal vessel segmentation tasks.
The proposed MsFE block extracts low-level features from different scales, and the MsAF block integrates channel attention from low-level features and spatial attention from high-level features, enabling the model to comprehend the image content more comprehensively.As observed in Figures 5-7, the proposed model accurately identified small vessels in the images, demonstrating lower false negatives compared with the baseline network.Meanwhile, the analysis of the differential images in the previous section for Figures 8-10 demonstrated that the false positives and false negatives of the proposed model were lower than those of the baseline model across all three datasets.All of these indicate that the fusion of features from different scales contributed to the improvement of accuracy in retinal vessel segmentation.
After a comprehensive analysis of the data in Tables 5-7, it is evident that in comparison with other state-of-the-art models, the proposed model achieved the highest values in sensitivity and F1-score on the three datasets.Although the proposed model did not attain the highest values in specificity, accuracy, and AUC, the differences compared with the other advanced models were minimal.Sensitivity measures a model's ability to identify true positive samples, while the F1-score provides a balanced evaluation of the model's classification performance on both positive and negative samples, which is particularly useful in scenarios with class imbalance.In the context of retinal vessel segmentation tasks, correctly identifying vessel pixels in the image is crucial.The small proportion of vessel pixels in retinal images led to a severe class imbalance issue.Therefore, given that the proposed model attained the highest values in sensitivity and F1-score on the DRIVE dataset, CHASE_DB1 dataset, and STARE dataset, it demonstrated excellent segmentation performance.
Overall, the segmentation performance of the proposed model was satisfactory.However, there were still some segmentation errors due to the unique morphology of the vessels and factors such as a low image contrast.Additionally, the limited size of the datasets may pose challenges to the generalization capability of the model.

Conclusions
In this paper, we propose a novel model for retinal vessel segmentation built upon the U-Net backbone architecture.For each layer of the U-Net backbone, two additional blocks, namely, the MsFE block and the MsAF block, were incorporated between the encoder and decoder.The MsFE block extracts low-level features from different scales, while the MsAF block performs attention fusion across various scales.Experimental evaluations were conducted on the DRIVE dataset, CHASE_DB1 dataset, and STARE dataset, demonstrating that MsAF-UNet exhibited competitive performance.
In future work, our focus will be on investigating multi-scale attention fusion mechanisms to further enhance the segmentation performance of retinal vessels.

Figure 1
Figure 1 illustrates the framework of the proposed model.The model is implemented based on the U-Net architecture and incorporates two blocks: the MsFE block and the MsAF block.These blocks are inserted between the encoder and decoder at each layer of the U-Net backbone.The MsFE block extracts low-level features from different scales, while the MsAF block performs feature fusion across different scales.Ultimately, the output of the MsAF block replaces the skip connection in the original U-Net.

Figure 1 .
Figure 1.The framework of the proposed model.

Figure 2 .
Figure 2. Illustration of the MsFE block.The MsFE block extracts features at different scales by using convolutional kernels of different sizes.

Figure 3 .
Figure 3. Illustration of the MsAF block.The MsAF block fuses channel attention from low-level features and spatial attention from high-level features.

Tables 3 and 4
display the sensitivity, F1-score, specificity, accuracy, and AUC values of the proposed model and the baseline model on the CHASE_DB1 dataset and STARE dataset, respectively.It can be observed that compared with the baseline models, all evaluation metrics of the proposed model achieved the highest values on the CHASE_DB1 dataset and the STARE dataset.The ROC curves for the proposed model and the baseline models on three datasets are depicted in Figure4.It is evident that the proposed model achieved higher ROC values on all three datasets compared with the baseline models, indicating superior segmentation performance of the proposed model over the baselines.

Figure 4 .
Figure 4.The ROC curves of the proposed model and baseline models.(a) The ROC curves on the DRIVE dataset.(b) The ROC curves on the CHASE_DB1 dataset.(c) The ROC curves on the STARE dataset.

Figure 5 .
Figure 5.The segmentation result on the DRIVE dataset.The red boxes in subfigures (c-f) represent false positives, while the green boxes represent false negatives.

Figure 6 .
Figure 6.The segmentation result on the CHASE_DB1 dataset.The red boxes in subfigures (c-f) represent false positives, while the green boxes represent false negatives.

Figure 7 .
Figure 7.The segmentation result on the STARE dataset.The red boxes in subfigures (c-f) represent false positives, while the green boxes represent false negatives.

Figure 8 .
Figure 8.The differential images from the DRIVE dataset.

Figure 9 .
Figure 9.The differential images from the CHASE_DB1 dataset.

Figure 10 .
Figure 10.The differential images from the STARE dataset.

Table 1 .
The specific information of each database.

Table 2 .
The segmentation results on the DRIVE dataset.

Table 3 .
The segmentation results on the CHASE_DB1 dataset.

Table 4 .
The segmentation results on the STARE dataset.

Table 5 .
Comparison results on the DRIVE dataset.

Table 6 .
Comparison results on the CHASE_DB1 dataset.

Table 7 .
Comparison results on the STARE dataset.