Bi-SANet—Bilateral Network with Scale Attention for Retinal Vessel Segmentation

: The segmentation of retinal vessels is critical for the diagnosis of some fundus diseases. Retinal vessel segmentation requires abundant spatial information and receptive ﬁelds with different sizes while existing methods usually sacriﬁce spatial resolution to achieve real-time reasoning speed, resulting in inadequate vessel segmentation of low-contrast regions and weak anti-noise interference ability. The asymmetry of capillaries in fundus images also increases the difﬁculty of segmentation. In this paper, we proposed a two-branch network based on multi-scale attention to alleviate the above problem. First, a coarse network with multi-scale U-Net as the backbone is designed to capture more semantic information and to generate high-resolution features. A multi-scale attention module is used to obtain enough receptive ﬁelds. The other branch is a ﬁne network, which uses the residual block of a small convolution kernel to make up for the deﬁciency of spatial information. Finally, we use the feature fusion module to aggregate the information of the coarse and ﬁne networks. The experiments were performed on the DRIVE, CHASE, and STARE datasets. Respectively, the accuracy reached 96.93%, 97.58%, and 97.70%. The speciﬁcity reached 97.72%, 98.52%, and 98.94%. The F-measure reached 83.82%, 81.39%, and 84.36%. Experimental results show that compared with some state-of-art methods such as Sine-Net, SA-Net, our proposed method has better performance on three datasets.


Introduction
At present, cataract, glaucoma, and diabetic retinopathy are the main diseases leading to blindness [1]. More than 418 million people worldwide suffer from these diseases, according to a report (https://www.who.int/publications/i/item/world-report-on-vision accessed on 13 September 2021) published by the World Health Organization. Patients suffering from eye diseases often do not notice the aggravation of asymptomatic conditions, so early screening and treatment of eye diseases is necessary [2]. The asymmetry of the fundus capillaries adds complexity to the physician's diagnosis of the disease. However, extensive screening or diagnosis is time-consuming and laborious. Therefore, the automatic segmentation method is particularly helpful for doctors to diagnose and predict eye diseases. In recent years, many scholars have studied retinal automatic segmentation algorithms, which are mainly divided into two categories: unsupervised methods and supervised methods [1].
Unsupervised blood vessel segmentation algorithms at home and abroad include the matched filter method [3], multi-threshold blood vessel detection [4], morphological-based blood vessel segmentation [5], the region growth method, the B-COSFIRE filter method [6], the multi-scale layer decomposition and local adaptive threshold blood vessel segmentation method [7], the finite element based binary level set method [8] and fuzzy clustering [9], etc. In [10], the multi-scale 2D Gabor wavelet transform and morphological reconstruction were used to segment the fundus vessels. In [11], a combination of level set and shape preference approach was presented.
Compared with the unsupervised method, the supervised method uses manually labeled data to train the classifier to segment the fundus vascular image. A typical supervised method is the deep convolutional neural network (CNN) [12][13][14]. However, CNN cannot make the structured prediction, and the fully connected neural network (FCN) came into being, providing an end-to-end blood vessel segmentation scheme. Therefore, it is rapidly applied to blood vessel segmentation. In [15], an FCN with a side output layer called DeepVessel was proposed. The work in [16,17] proposed an FCN with a down-sample and up-sample layer to solve the class imbalance between blood vessels and background.
In addition, encoder and decoder structures are widely used in fundus image segmentation due to their excellent feature extraction capability, especially U-Net [18]. In [19], it proposed a multi-label architecture based U-Net. A side output layer was used to capture multi-scale features.
On the basis of encoders and decoders, a dual-branch network is another method to segment fundus images. One branch of the dual-branch network is used as a coarse segmentation network, and another fine segmentation network is used as an aid to the coarse segmentation network. The work in [20] proposed a multi-scale two-branch network (MS-NFN) to improve the performance of capillary segmentation. Both branches had similar U-Net structures. The work in [21] proposed a coarse-to-fine segmentation network (CTF-Net) for fundus vascular segmentation. Moreover, a multi-scale network is another important direction of fundus image segmentation. In [22], it proposed VesselNet based on a multi-scale method. In [23], a cross-connected convolution neural network (CcNet) for blood vessel segmentation was proposed, which also adopted a multi-scale method. In order to improve the segmentation ability of the network, the attention mechanism is gradually applied to retinal vessel segmentation. In [24], an attention guiding network (AG-Net) was proposed.
From the CNN and FCN to the U-shaped network architecture, the basic network for blood vessel segmentation has developed abundantly; the network with U-Net structure has especially become more and more popular. However, in the existing methods, there are still the following challenges: (1) The contrast and extraneous noise of fundus images (see Figure 1) make it difficult to segment blood vessels in fundus images, especially the segmentation of low-contrast regions and lesion regions. (2) The general structure of a two-branch network used two U-Net networks as the backbone, which led to the increase in the computational cost and the prolongation of the segmentation time of the model. This is not friendly to clinical diagnosis and medical treatment. (3) There is still space for improvement in the accuracy of blood vessel segmentation. Specifically, it is necessary to further improve the sensitivity while maintaining specificity and accuracy. (4) Abundant spatial information and receptive fields with different sizes cannot be satisfied at the same time.
To address the above problems, this paper proposes a retinal vessel segmentation model based on a bilateral network with scale attention (Bi-SANet). The main contributions of this paper include the following contents: • We propose a bilateral network containing coarse and fine branches, and the two branches are responsible for the extraction of semantic information and spatial information, respectively. Meanwhile, multi-scale input is carried out on the network to improve the feature extraction ability of the model for images of different scales. • In order to improve the network's ability to extract vessel semantic information in low-contrast regions, a multi-scale attention module is introduced at the end of downsampling of the coarse network. This can improve the pertinence of information recovery in up-sampling. • The U-shaped fine network is replaced by a module. This module mainly uses convolution layers with different dilation rates to make up for the lost spatial information of the coarse network. It not only improves the network segmentation ability but also reduces its computational complexity. Finally, the feature fusion module is used to aggregate different levels of information of the coarse and fine networks.

Optic disc region Macular area
High contrast region Low contrast region The remainder of this article is organized as follows. Section 2 describes the proposed method in detail, including the network structure, multi-scale attention module, fine network structure, and feature fusion module. Section 3 introduces the datasets, experimental setting, and evaluation index. In Section 4, our experimental results are discussed and compared. Finally, the conclusion is drawn in Section 5.

The Network Structure of Bi-SANet
Compared with the encoder-decoder structure, the dual-branch network has different characteristics. The encoder-decoder structure uses the encoder network to extract features, the up-sample operation in the decoder restores the original resolution and connects the decoder with the feature map of the encoder.
In this paper, a two-branch network is used to segment fundus images: one is a coarse network (CoarseNet), the other is a fine network(FineNet). The network structure of this paper is shown in Figure 2. Among them, the coarse network is responsible for extracting the semantic information of fundus images, and the fine network makes up for the lost spatial information of the coarse network. As the backbone of the network, the improved U-Net is used in the coarse network to extract feature information. However, some spatial information cannot be captured in the process of down-sampling. In order to make up for the lack of spatial information, we a use fine network to further extract fine semantic information. Finally, the feature fusion module is used to aggregate the information of CoarseNet and FineNet.
In the CoarseNet, in order to extract the semantic information of feature maps with different scales, we introduced a multi-scale attention module at the end of the encoder. In the FineNet, because the spatial detail information needs to be saved, only one downsampling operation is used as the encoder module. The spatial information module is used to extract spatial detail information.

Coarse Network (Coarse)
As shown in Figure 2, the basic network of the coarse network structure is a multi-scale U-Net. The multi-scale input is divided into four branches. The input image channels of the four branches are all 1. The 48 × 48, 24 × 24, 12 × 12, and 6 × 6 pixels are the input sizes for each branch.
The network structure of the coarse network is mainly composed of two parts: the first part is the encoder-feature extraction part, and the second part is the decoder-up-sampling part (gray part of Figure 2). The proposed multi-scale attention module is embedded in the encoder. The use of attentional mechanisms allows the network to focus on information related to blood vessels, to suppress noise in the background, and to alleviate information loss caused by down-sampling. It can enhance context semantic information. The details of the multi-scale attention module are as follows.

Multi-Scale Attention Module
Multi-scale U-Net obtains feature mappings of different scales in the process of downsampling. In order to deal with objects of different scales better, a multi-scale attention module is used to combine these feature maps. This can alleviate the loss of semantic information during down-sampling.
The feature maps with different scales have different correlations with the input of the network, so we add an attention mechanism to calculate the correlation weight for each feature map with different scales. In this way, the network can pay more attention to the feature map with a higher correlation with the input image. This module is used at the end of the encoder.
Our proposed multi-scale attention module is shown in Figure 3. First, we use bilinear interpolation to up-sample the feature maps F s of different scales obtained by the decoder to the original image size. In order to reduce the computational cost, we use 1 × 1 convolution to compress these feature maps into four channels. Additionally, the compression results from different scales are superimposed together through channels to form a feature map F. The mixed feature maps are pooled globally and maximally to obtain the correlation weight coefficient β of each channel. Then, we use element multiplication to distribute the obtained correlation weight coefficients to four channels to obtain the attention mixed feature map F'. In order to alleviate the problems of gradient disappearance and gradient explosion, we then added skip connection. Finally, y MA is obtained from the mixed feature map and the attention mixed feature map by element addition. The calculation formula of the multi-scale attention module is defined as follows: Among them, δ represents sigmoid activation function, and P avg and P max represent the global average pooling and maximum pooling, respectively.

Fine Network (FineNet)
In order to capture more accurate boundary information, the FineNet only downsamples the feature map once with a convolution kernel of 3. The result of the down-sampling is input to the spatial information module for processing. There is no decoder module because of the need to save the spatial details in the FineNet. First, it reduced the size of the feature map by using one-step down-sampling convolution. Then, it applied a spatial information module to capture spatial details.
Details of the spatial information module are shown below.

Spatial Detail Module
The advantage of dilation convolution over traditional convolution operations is the ability to achieve larger receptive fields without increasing the number of parameters and maintaining the same feature resolution. The model can better understand the global context information. The thickness of blood vessels in retinal images is different. In order to segment blood vessels with different thicknesses more accurately, we use dilated convolution with different dilation rates in the spatial detail module to capture multi-scale feature information. This can improve the segmentation accuracy of blood vessel edges and tiny blood vessels. The structure of the spatial detail module is shown in Figure 4. In the spatial detail module, the number of channels of the input feature map is first halved using the 1 × 1 convolution method to obtain the feature map G 1 . Then, three parallel convolution layers are used to capture the spatial feature information of the G 1 . The number of channels is halved in order to reduce the number of parameters and computations to 1/2, thus increasing the efficiency of the model to segment the fundus vessels. Three convolution layers with different dilation rates can capture multiscale context information. The feature maps output by the three convolution layers is concatenated by channel. The skip connection is used to preserve the feature information of the original scale. Finally, the feature is upsampled to a feature map size that can match the output of the CoarseNet.

Feature Fusion Module
As the FineNet is downsampled only once, the feature map obtained by the FineNet contains detailed information in fundus vessels, which is called low-level semantic information.
The feature map obtained by the coarse network contains rich context semantic information, but the detailed information in the retinal image is seriously lost after multiple downsamples, especially the boundary information of small blood vessels. The information obtained from the CoarseNet is called high-level semantic information. Low-level features are rich in details but lack semantic information. Therefore, we proposed a feature fusion module to integrate high-level features and low-level features, to enrich the spatial information in the coarse network, and to eliminate background noise from low-level features. The calculation process of high-level and low-level information aggregation is represented by Equation (2).
Among them, F L represents low-level features, F H represents high-level features, and ⊕ is the element-level addition. After the above operations, in order to make the network pay more attention to the target area, we use global average pooling to process the fused features. In short, the final refined output is computed as Equation (3).
where X c (i, j) is the pixel value of the feature map F c at the (i, j) position on the C channel.

Datasets
The method in this paper is validated on three public datasets, DRIVE [25], CHASE [26] and STARE [27]. In the DRIVE data set, there are 40 retinal images, corresponding groundtruth images, and masks images. The size of each image is 565 × 584 pixels. The training and test sets are half of the images in the DRIVE dataset.
The CHASE data set consists of 28 retinal images, corresponding ground-truth images, and mask images, each of which is 1280 × 960 pixels. For the CHASE data set, we adopt the partition method proposed by Zhuang et al. [28] to train the first twenty images, and tested and evaluated the remaining eight images.
The STARE data set contains 20 retinal images, corresponding ground-truth images, and mask images. Each image is 700 × 605 pixels. We used the leave-one method to generate a training set and a test set. Each image is tested once. We averaged the experimental results of 20 images to obtain the evaluation results for the whole STARE dataset.
In addition, we expand the training data set by using the random patch method [29] on the image, which is very important to improve the accuracy of segmentation, prevent over-fitting and improve the robustness of the network.

Experimental Environment and Parameter Settings
Our network structure implementation and training method followed that of Jiang et al. [30]. It was implemented on a server with Quadro RTX 6000 on Ubuntu64, based on the open source package Pytorch. We trained the model method with the random patch method in [29]. The patch size was set to 48 × 48 pixels. During the training period, the epochs were set to 100, the batch size was set to 256, and the number of patches extracted from each image was 10,480. The initial learning rate was set to 0.001. The learning rate was updated using the step decay method. The decay coefficient and the weight decay coefficient were set to 0.01 and 0.0005, respectively. The optimizer we use for the model was Adam, where the momentum was 10 −8 .
The loss function used a cross-entropy loss function. It is defined as follows: where y i means the real label andŷ i represents the predicted label.

Performance Evaluation Indicator
In this paper, Sensitivity, Specificity, Accuracy, F-measure, and other evaluation indicators were calculated by a confusion matrix. Additionally, the performance of retinal image segmentation was analyzed. Each evaluation index is defined by the following: where TP is the number of correctly divided blood vessel pixels, TN is the number of correctly divided background pixels, FP is the background pixel incorrectly divided into blood vessel pixels, and FN is the blood vessel pixel incorrectly marked as the background pixel.

Discussion of Model Performance at Different Dilation Rates
Convolution combinations with different dilation rates have different receptive fields. In order to verify the influence of different receptive fields, we compared the experimental results of the model under different dilation rates on the DRIVE data set. From the experimental results in Table 1, it can be seen that, when the dilation rate is (1,3,5) in the spatial detail module, the model has the strongest segmentation ability. The bolded data indicate that the value is the optimal value for the indicator.

Structure Ablation
In order to verify the effectiveness of the multi-scale attention module (MA), we compared the experimental results before and after adding this module to the model. In Tables 2 and 3, MUet represents our original baseline network, which is a multi-scale input UNet, MA represents a multi-scale attention module, FineNet represents the fine network, and FFA represents a feature fusion module. MU-Net and MA make up CoarseNet. Tables 2 and 3, it can be seen that the accuracy of the model on DRIVE and CHASE data sets has been improved to some extent after adding the multi-scale attention module (MA). As MA can effectively capture multi-scale information, the model can better segment retinal vessels with different thicknesses. In Tables 2 and 3, we compared the influence of the FineNet on the accuracy. From the experimental results, we can see that the addition of FineNet improved the accuracy, sensitivity, F-measure, and AUC curve area of the model to varying degrees. Additionally, the addition of FineNet does not bring a particularly large number of parameters to the network while improving segmentation performance. The FineNet used few downsampling modules and does not contain any up-sampling and few pooling operations. It retained the edge details of the image and made up for the loss of information due to more convolution and pooling in the CoarseNet.

From the experimental results in
In order to verify the effectiveness of our added module, in this section, we compared the ROC (Receiver Operating Characteristic) and PR (Precision Recall) curves of different models on the DRIVE and CHASE data sets. In Figures 5 and 6, MUet represents our original baseline network, which is a multi-scale input U-Net; MA represents a multi-scale attention module; FineNet represents a fine network; and FFA represents a feature fusion module. From Figure 5, we find that the ROC and PR values of the model increase for each module added. The model with the highest ROC and PR values is the model that contained all three modules. Compared with the baseline network, its ROC value is increased by 0.3% and the PR value is increased by 1.69%. The closer the ROC curve is to the upper left corner of the coordinate, the more accurate the model is.

Attention Module Ablation
The multi-scale attention module (MA) was designed to enhance the network segmentation capability. In order to verify the effectiveness of the MA module and to verify that it had better performance than other attention modules, we designed a set of ablation experiments. The MU-Net+FineNet+FFA in the structural ablation section was used as a baseline and only the attention module embedded in the CoarseNet was changed to verify the advanced nature of the MA module proposed in this paper. We have selected three classic, lightweight attention modules to compare with the MA module. Among them, the first is the efficient channel attention (ECA) module of ECA-Net [31], which is often used in object detection and instance segmentation tasks. It was empirically shown that avoiding dimensionality reduction and appropriate cross-channel interaction are important to learn effective channel attention. The second attention module is the squeeze-andexcitation (SE) block in SENet [32], which improves the performance of model classification by modeling the inter-dependencies between feature channels. The last attention module is the expectation-maximization attention (EMA) block in EMANet [33] for semantic segmentation. It iterates through the expectation maximization (EM) algorithm to produce a compact set of bases on which to run the attention mechanism, thus greatly reducing the complexity. Tables 4 and 5 show the experimental results of the MA module with the ECA module, SE block, and EMA module added into the baseline network, respectively. Although three attention modules, ECA, SE, and EMA, improved the performance of the model to a certain extent, from the evaluation indicators of F-Measure, the overall segmentation results of the MA were higher than those of the three attention modules embedded in the baseline network. Compared with the other three methods, the MA module has the best performance. This is because MA can automatically obtain the importance of each feature map by learning and can then promote the useful features and suppress the features that are not useful for the current task according to this importance. Meanwhile, MA brings minimal parameters to the network, which is friendly for clinical diagnosis.

Model Parameter Quantity and Computation Time Analysis
The total parameters of this model are 46.17 MB. Our model took 6 s, 9.61 s, and 10.62 s to segment a complete retinal vessel image on DIRVE, CHASE, and STARE datasets. The U-Net model in [34] took 4 s to segment a complete image of the DRIVE data set, and the F-measure reached 0.8142. The F-measure of our model on the DRIVE data set is 0.8382. Compared with U-Net, our model takes 2 s more to segment images from DRIVE data sets, but our model is higher than U-Net in all indexes, especially the sensitivity is 13.53% higher than U-Net.

Visual Comparison with Different Methods
We compared our method with the methods proposed by U-Net [18] and WA-Net [35]. Figures 7 and 8 present a visualization of DRIVE and STARE data sets, respectively. In Figure 8, column (c) represents the result of segmentation using the slice method in [23]. The retinal vessels segmented by U-Net [18] contained more noise, and the background was mistakenly segmented into blood vessels. The small vessels at low contrast region are not clearly segmented and appear to have broken blood vessels. Although the retinal vessels segmented by WA-Net [35]contained less noise, the problem of unclear small vessels remained. Cc-Net [23] is susceptible to the lesion area, resulting in a lot of noise in the segmentation results. The segmented blood vessels are discontinuous.
The method proposed in this paper uses CoarseNet to extract rich context semantic information. The FineNet makes up for the spatial information, enables the network to distinguish the foreground and background regions well, and reduces wrong segmentations.
The segmentation results of our method contain less noise, especially the segmentation of small vessels in low-contrast regions. As shown in the red boxed regions of Figure 7 and 8. In these figures, it can be seen that the noise in the retinal vessel image segmented by our method is less and that the segmentation of small blood vessels is more comprehensive, clearer, and has better robustness and accuracy.

Comparison of Segmentation Results with Different Methods
In order to further verify the effectiveness of the proposed algorithm for retinal vessel segmentation, the proposed method is compared with some unsupervised and supervised methods in accuracy, sensitivity, specificity, and F-measure on the three data sets of DRIVE, CHASE, and STARE. From the experimental results in Tables 6-8, compared with unsupervised methods, supervised methods generally have better performance on retinal vessel segmentation. For the DRIVE data set, the F-measure of retinal vessel segmentation in this method reaches 83.82%, which is 0.93% higher than that in [36], and the sensitivity is 2.51% higher than that in [37]. From Tables 6-8, it can be seen that the specificity, accuracy, and F-measure of our method on different data sets are the highest in the table. Although the sensitivity of [37] on the CHASE data set is higher than that of our method, the segmentation effect on small blood vessels is not very good, and sometimes, a fracture occurs. Moreover, our method has the highest F-measure, the specificity remains relatively stable, and the noise contained in the segmented image is relatively small. On the STARE data set, our method is 0.2% and 0.63% higher than that in [34] in sensitivity and F-measure, respectively. Therefore, from the evaluation results of blood vessel segmentation in Tables 6-8, it can be seen that the method in this paper is superior to other supervised blood vessel segmentation methods in segmenting retinal vessels and backgrounds and extracting different features. Table 9 shows test results per image on STARE data sets using the leave-one method.

Conclusions
The segmentation of retinal vessels is a key step in the diagnosis of ophthalmic diseases. In this paper, we proposed a two-branch model with a scale attention mechanism, which can automatically segment blood vessels in fundus images. The coarse network of this model takes a multi-scale U-Net as the backbone to capture more semantic information and to generate high-resolution features. At the same time, a multi-scale attention module is used to obtain enough reception fields. The other branch is a fine network, which used the residual block of a small convolution kernel to make up for the deficiency of spatial information from the coarse network. Finally, the feature fusion module is used to aggregate the information of coarse and fine branches. We validated this method on the DRIVE, STARE, and CHASE data sets, and the experimental results showed that our method has better performance in retinal vessel segmentation than some latest algorithms, such as WA-Net [35] and Sine-Net [42].
Several experimental results showed that our model has good results on these three data sets, which indicates that it has practical application potential in the screening and diagnosis system of ophthalmic diseases. The visualization results show that our method has good performance on small vessels in low-contrast areas. The imbalance in the number of foreground and background pixels in fundus images is also a problem that hinders vessel segmentation. In the future, we will alleviate the above problem by designing an auxiliary loss function.  Institutional Review Board Statement: Ethical review and approval are not applicable for this paper.
Informed Consent Statement: An informed consent statement is not applicable.

Data Availability Statement:
We used three public datasets to evaluate the proposed segmentation network, namely DRIVE [25], CHASE [26], and STARE [27].

Conflicts of Interest:
The authors declare no conflict of interest.