1. Introduction
According to the World Vision Report (
https://www.who.int/publications/i/item/world-report-on-vision accessed on 5 September 2021) released by the World Health Organization in October 2019, more than 418 million people worldwide have glaucoma, diabetic retinopathy (DR), age-related macular degeneration (AMD), or other eye diseases that can cause blindness. Accurate segmentation of retinal images is an important prerequisite for doctors to perform a professional diagnosis and prediction of related diseases. In current clinical practice, however, manual visual inspection is usually employed to obtain this morphological information, which is laborious, time-consuming, and subjective [
1]. Automatic segmentation algorithms can help doctors analyze complex fundus images, and the accuracy is gradually improving, which has attracted more attention in recent years. Currently, tremendous efforts have been made on retinal vessel segmentation. These methods can be broadly divided into four main categories: window-based processing methods, classification-based methods, tracking-based methods, and deep-learning-based methods.
Window-based processing methods: Chaudhuri et al. [
2] used two-dimensional matched filters for retinal vessel segmentation. Based on this, a segmentation method that combines multiscale matched filtering and dual thresholding was proposed [
3]. In [
4], a segmentation algorithm based on the Gabor filter was proposed. In [
5], a Cake filter was proposed, which can better detect elongated structures in images. The method based on window processing can maintain the original structure of blood vessels and has a better segmentation effect on thin blood vessels, but it needs to process each pixel, so it has the drawback of large computation and long time consumption.
Classification-based methods: The classification-based method first trains the classifier by extracting the feature vectors of the pixels and then classifies the segmentation regions obtained from the low-level processing into blood vessels or backgrounds. Some of the common classifiers are the KNN classifier proposed by Salem et al. [
6] and the AdaBoost classifier proposed by Carmen et al. [
7]. Classification-based methods generally require manual extraction of features and manual selection of classifiers. Therefore, there is still much space for improvement in the efficiency of the algorithm.
Tracking-based methods: The tracking-based method first determines an initial seed point, then from that point, iteratively following the characteristics of the vessel, such as vessel width, position direction, etc. The semi-automatic vascular tracking algorithm starts with the initial point and initial direction, using a width priority search, iteratively searching for vessels. A tracking algorithm for manually determining the initial point was proposed by Liu et al. [
8], which achieved the final segmentation by continuously finding new starting points for resegmentation in the remaining vessels. In [
9], some of the brightest points in the vascular pixels were found and used as starting points. In [
10], a particle filter was used for retinal vessel tracking. Since semi-automatic vascular tracking algorithms rely on the determination of the starting point, they have been gradually replaced by fully automated vascular tracking algorithms [
11,
12]. The method based on fully automatic vascular tracking is very adaptive, but relies heavily on the selection of initial seed points and direction.
Deep-learning-based methods: Deep-learning-based approaches generally first build a training model using blood vessels and background data and then use the training model to classify each pixel in retinal images. A typical example based on convolutional neural networks (CNNs) was the blood vessel segmentation method proposed by Khalaf et al. [
13]. The author used a CNN containing three convolution layers to perform blood vessel segmentation. Fu et al. [
14] proposed a complete convolution network called DeepVessel. They used a side output layer to help the network learn multiscale features. In addition, the encoder and decoder structures are widely used in fundus image segmentation due to their excellent feature extraction capability, especially U-Net [
15]. On the basis of the encoder and decoder, Wu et al. [
16] proposed VesselNet based on a multiscale method. Feng et al. [
17] proposed a cross-connected convolution neural network (CcNet) for blood vessel segmentation, which also adopted a multiscale method. In order to improve the segmentation ability of the network, the attention mechanism was gradually applied to retinal vessel segmentation. Zhang et al. [
18] proposed an attention-guided network (AG-Net) for blood vessel segmentation.
Efforts have been made to improve the accuracy of retinal vessel segmentation. The basic network for blood vessel segmentation has been extensively developed; especially the network with the U-Net structure has become more and more popular. However, the skip-connection between the encoder and decoder in U-Net is too simple, resulting in noise being transmitted to the decoder as well. In order to restore the original image size for pixel-level prediction, upsampling in the decoder is usually realized by bilinear interpolation and deconvolution. A drawback of the oversimple bilinear interpolation is that it does not take into account the correlation between each pixel. These problems all lead to broken microvessels in the segmentation results of the model, low accuracy, and sensitivity of the model to noise and lesions.
To tackle the above problems, a gated skip-connection network with adaptive upsampling (GSAU-Net) is proposed to segment retinal vessels. The main work of this paper includes the following contents:
We propose a gated skip-connection network with adaptive upsampling (GSAU-Net) to segment retinal vessels. A gating is introduced to the skip-connection between the encoder and decoder. A gated skip-connection was designed to facilitate the flow of information from the encoder to the decoder, which can effectively remove noise and help the decoder focus on processing the detailed information;
A simple, yet effective upsampling module is used to recover feature maps from the decoder, which replaces the data-independent bilinear interpolation used extensively in previous methods. Compared to deconvolution, it improves the performance of the model with almost no additional computational cost;
Finally, comprehensive experiments were performed to evaluate the proposed method on three public datasets (DRIVE [
19], CHASE [
20], and STARE [
21]), showing its effectiveness.
The remainder of this article is organized as follows. The second section describes the proposed method in detail, including the network backbone structure, gated skip-connection, and adaptive upsampling. The third section introduces the datasets, experimental setting, and evaluation index. In the fourth section, our experimental results are discussed and compared. Finally, the conclusion is drawn in the fifth section.