Aiding the Diagnosis of Diabetic and Hypertensive Retinopathy Using Artificial Intelligence-Based Semantic Segmentation

Automatic segmentation of retinal images is an important task in computer-assisted medical image analysis for the diagnosis of diseases such as hypertension, diabetic and hypertensive retinopathy, and arteriosclerosis. Among the diseases, diabetic retinopathy, which is the leading cause of vision detachment, can be diagnosed early through the detection of retinal vessels. The manual detection of these retinal vessels is a time-consuming process that can be automated with the help of artificial intelligence with deep learning. The detection of vessels is difficult due to intensity variation and noise from non-ideal imaging. Although there are deep learning approaches for vessel segmentation, these methods require many trainable parameters, which increase the network complexity. To address these issues, this paper presents a dual-residual-stream-based vessel segmentation network (Vess-Net), which is not as deep as conventional semantic segmentation networks, but provides good segmentation with few trainable parameters and layers. The method takes advantage of artificial intelligence for semantic segmentation to aid the diagnosis of retinopathy. To evaluate the proposed Vess-Net method, experiments were conducted with three publicly available datasets for vessel segmentation: digital retinal images for vessel extraction (DRIVE), the Child Heart Health Study in England (CHASE-DB1), and structured analysis of retina (STARE). Experimental results show that Vess-Net achieved superior performance for all datasets with sensitivity (Se), specificity (Sp), area under the curve (AUC), and accuracy (Acc) of 80.22%, 98.1%, 98.2%, and 96.55% for DRVIE; 82.06%, 98.41%, 98.0%, and 97.26% for CHASE-DB1; and 85.26%, 97.91%, 98.83%, and 96.97% for STARE dataset.


Introduction
The segmentation of blood vessels from retinal images is a difficult and time-consuming task for medical specialists when diagnosing diseases such as hypertension, diabetic retinopathy, and arteriosclerosis [1]. Retinal vasculature is considered a unique entity to examine the structural and pathological changes related to ophthalmic diseases (glaucoma, diabetic retinopathy, hypertension, age-related macular degeneration etc.). The measurement and analysis of retinal vessels can be used as a biomarker for the diagnosis of cardio patients [2], similar to how homocysteine is used as a biomarker for diabetic retinopathy (DR) [3], which is the leading cause of vision loss [4]. Correct retinal vessel segmentation provides the opportunity for early diagnosis of diabetic retinopathy, which can later lead to blindness, and also helps to localize the position of optical discs and fovea [5]. Considering the

Related Works
Vessel segmentation can be divided into two main groups: techniques based on conventional handcrafted local features using typical image-processing schemes and techniques that use machine learning or deep-learning features.

Vessel Segmentation Based on Conventional Handcrafted Local Features
These methods use conventional image-processing schemes to identify vessels in fundus images. The usual schemes are color-based segmentation, adaptive thresholding, morphological schemes, and other local handcrafted feature-based methods that use image enhancement prior to the segmentation. Akram et al. used a 2D Gabor filter for retinal image enhancement and the multi-layer thresholding approach to detect blood vessels [23]. Fraz et al. used quantitative analysis of retinal vessel topology and size (QUARTZ), where vessel segmentation is carried out using a line detection scheme in combination with hysteresis morphological reconstruction based on a bi-threshold procedure [24]. Kar et al. used automatic blood vessel extraction using a matched filtering-based integrated system, which uses a curvelet transform and fuzzy c-means algorithm to separate vessels from the background [25]. Another recent example of an unsupervised approach was illustrated by Zhao et al., who used a framework with three steps. In the first step, a non-local total variation model adapted to the Retinex theory is used. In the second step, the image is divided into super-pixels to locate the object of interest. Finally, the segmentation task is performed using an infinite active contour model [26]. Pandey et al. used two separate approaches to segment thin and thick blood vessels. To segment thin blood vessels, local phase-preserving de-noising is used in combination with line detection, local normalization, and entropy thresholding. Thick vessels are segmented by maximum entropy thresholding [27]. Neto et al. proposed an unsupervised coarse-to-fine method for the blood vessel segmentation. Image enhancement schemes, such as Gaussian smoothing, morphological top-hat filtering, and contrast enhancement, are first used to increase the contrast and reduce the noise, and then the segmentation task is carried out via adaptive local thresholding [28]. Sundaram et al. proposed a hybrid segmentation approach that uses techniques such as morphology, multi-scale vessel enhancement, and image fusion i.e., area-based morphology and thresholding are used to highlight blood vessels [29]. Zhao et al. proposed an infinite active contour model to automatically segment retinal blood vessels, where hybrid region information of the image is used for small vasculature structure [30]. To detect vessels rapidly and accurately, Jiang et al. proposed a global thresholding-based morphological method, where capillaries are detected using centerline detection [31]. Rodrigues et al. performed vessel segmentation based on the wavelet transform and mathematical morphology, where tubular properties of blood vessels were used to detect retinal veins and arteries [32]. Sazak et al. proposed a retinal blood vessel image-enhancement method in order to increase the segmentation accuracy. They used the multi-scale bowler-hat transform based on mathematical morphology, where vessel-like structures are detected by thresholding after combining different structuring elements [33]. Chalakkal et al. proposed a retinal vessel segmentation method using the curvelet transform in combination with line operators to enhance the contrast between the background and blood vessels; they used multiple steps of conventional image processing such as adaptive histogram equalization, diffusion filtering, and color space transformations [34]. Wahid et al. used multiple levels to enhance retinal images for segmentation. In their technique, the enhanced image is subtracted from the input image iteratively, resultant images are fused to create one image, and this image is then enhanced using contrast-limited adaptive histogram equalization (CLAHE) and fuzzy histogram-based equalization (FHBE). Finally, thresholding is used to segment the enhanced image [35]. Ahamed et al. also applied CLAHE with the green channel of fundus images and used a multiscale line detection approach in combination with hysteresis thresholding; the results in this technique are refined by morphology [36].

Vessel Segmentation Using Machine Learning or Deep Learning (CNN)
Methods based on handcrafted local features have a limited performance. In addition, the performance is affected by the type of database. Therefore, machine learning or deep learning-based methods have been researched as an alternative. Zhang et al. used a supervised learning method for vessel segmentation. They used the anisotropic wavelet transform, where a 2D image is lifted to a 3D image that provides orientation and position information. Then, a random forest classifier is trained to segment retinal vessels from the background [37]. Tan et al. proposed a single neural network to segment optic discs, fovea, and blood vessels from retinal images. The algorithm passes the three channels of input from the point's neighborhood to the seven-layer convolutional neural network (CNN) to classify the candidate class [38]. Zhu et al. proposed a supervised method based on extreme machine learning (EML), which utilizes a 39D vector with features such as morphology, divergence field, hessian features, phase congruency, and discriminative features. These features are then classified by EML, which extracts the vasculature from the background [39]. Wang et al. proposed a cascade classification method for retinal vessel segmentation. They iteratively trained a Mahalanobis distance classifier with a one-pass feed-forward process to classify the vessels and background [40]. Tuba et al. proposed support vector machine (SVM)-based classification using chromaticity and coefficients of the discrete cosine transform as features. The green channel from retinal images was used as the base of these features as it has maximum vessel information [41]. Savelli et al. presented a novel approach to segment vessels that corrected the illumination. Dehazing was used as a pre-processing technique to avoid haze and shadow noise, and classification was performed by a CNN that was trained on 800,000 patches with a dimension of 27 × 27 (the center pixel was considered the decision pixel) [42]. Girard et al. proposed a fast deep learning method to segment vessels using a U-Net-inspired CNN for semantic segmentation, where the encoder and decoder provide the down-sampling and up-sampling of the image, respectively [43]. Hu et al. proposed a method for retinal vessel segmentation based on a CNN and conditional random fields (CRFs). Basically, there are two phases in this method; in the first phase, a multiscale CNN architecture with improved cross-entropy loss function was applied to the image, then CRFs were applied to obtain the refined final result [44]. Fu et al. proposed DeepVessel, a program that uses deep learning in combination with CRFs. A multi-scale and multi-level CNN is used to learn rich hierarchical representations from images [45]. Soomro et al. proposed a deep-learning-based semantic segmentation network inspired by the famous SegNet architecture. In the first step, grayscale data were prepared by principle component analysis (PCA). In the second step, deep-learning-based semantic segmentation was applied to extract the vessels. Finally, post-processing was used to refine the segmentation [46]. Guo et al. proposed a multi-level and multi-scale approach, where short-cut connections were used for the semantic segmentation of vessels and semantic information was passed to forward layers to improve the performance [47]. Chudzik et al. proposed a two-stage method to segment retinal vessels. In the first step, the CNN is utilized to correlate the image with corresponding ground truth by random tree embedding. In the second stage, a codebook is created by passing the training patches through the CNN in the previous step; this codebook is used to arrange a generative nearest-neighbor search space for the feature vector [48]. Hajabdollahi et al. proposed a simple CNN-based segmentation with fully connected layers. These fully connected layers are quantized and the convolutional layers are pruned to increase the efficiency of the network [49]. Yan et al. proposed a three-stage CNN approach for vessel segmentation to improve the capability of vessel detection. The thick and thin vessels are treated by separate CNNs and the results are fused to produce a single image by a third CNN [50]. Soomro et al. proposed a semantic segmentation network based on modified U-Net, where the pooling layers are replaced by progressive convolution and deeper layers. In addition, dice loss is used as a loss function with stochastic gradient descent (SGD) [51]. Jin et al. proposed a deformable U-Net-based deep neural network. The deformable convolutions are integrated in the network and an up-sampling operator is used to increase the resolution of the image to extract more precise feature information [52]. Leopold et al. presented Pixel CNN with batch normalization (PixelBNN), which is based on U-Net and pixelCNN, where pre-processing is used to resize, reduce the dimension, and enhance the image [53]. Wang et al. used Dense U-Net as a semantic segmentation network for vessel segmentation, where random transformations are used for data augmentation in order to boost the effective patch-based training of the dense network [54]. Feng et al. proposed a cross-connected CNN (CcNet) for retinal vessel segmentation. The CcNet is trained on only the green channel of the fundus image; cross connections and fusion of multi-scale features improve the performance of the network [55]. However, in these previous works, deep networks were used, which included many trainable parameters that increased the network complexity. To address these issues, this paper presents a dual residual stream-based Vess-Net, which is not as deep as conventional semantic segmentation networks but provides good segmentation with few trainable parameters and layers. The method takes advantage of artificial intelligence in the process of semantic segmentation to aid the diagnosis of retinopathy. Table 1 shows a comparison between existing methods and Vess-Net for vessel segmentation. Vessel segmentation using thresholding [23,24,28,29,31,[33][34][35][36] Simple method to approximate vessel pixels False points detected when vessel pixel values are closer to background Fuzzy-based segmentation [25] Performs well with uniform pixel values Intensive pre-processing is required to intensify blood vessels' response Active contours [26,30] Better approximation for detection of real boundaries Iterative and time-consuming processes are required Vessel tubular properties-based method [32] Good estimation of vessel-like structures Limited by pixel discontinuities Line detection-based method [27] Removing background helps reduce false skin-like pixels Random forest classifier-based method [37] Lighter method to classify pixels Various transformations needed before classification to form features Patch-based CNN [38,42] Better classification Training and testing require long processing time SVM-based method [41] Lower training time Use of pre-processing schemes with several images to produce feature vector Extreme machine-learning [39] Machine learning with many discriminative features Morphology and other conventional approaches are needed to produce discriminative features Mahalanobis distance classifier [40] Simple procedure for training Pre-processing overhead is still required to compute relevant features U-Net-based CNN for semantic segmentation [43] U-Net structure preserves the boundaries well Gray scale pre-processing is required Multi-scale CNN [44,47] Better learning due to multi-receptive fields

Contribution
This study paper on retinal vessel segmentation under challenging conditions to aid the process of diagnosis in retinopathy. Compared to previous works, our research presents the following novelties: -Vess-Net performs semantic segmentation to detect retinal vessels without the requirement of conventional pre-processing. -Vess-Net guarantees dual-stream spatial information flow inside and outside the encoder-decoder. -Vess-Net's internal residual skip path (IRSP) ensures feature re-use policy in order to compensate for spatial loss created by the continuous convolution process. -Vess-Net's outer residual skip path (ORSP) is designed to provide direct spatial edge information from the initial layer of encoder to the end of decoder. Moreover, the direct information flow pushes the Vess-Net to converge faster (in just 15 epochs with 3075 iterations). -Vess-Net utilizes the benefits of both identity and non-identity mappings for outer and inner residual connections, respectively -For fair comparison with other research results, the trained Vess-Net models and algorithms are made publicly available in [56]. Figure 1 shows an overview of the proposed method, which represents the overall semantic segmentation process in the detection of retinal blood vessels. Vess-Net provides accurate pixel-wise segmentation using a dual-stream spatial information flow provided by the residual mesh. The inner residual mesh compensates for the lost spatial information, whereas the outer residual mesh provides direct edge information from the initial layers to the end of the decoder. The Vess-Net takes the original image as the input without pre-processing and provides pixel-wise segmentation in an encoder-decoder manner.  Figure 1 shows an overview of the proposed method, which represents the overall semantic segmentation process in the detection of retinal blood vessels. Vess-Net provides accurate pixel-wise segmentation using a dual-stream spatial information flow provided by the residual mesh. The inner residual mesh compensates for the lost spatial information, whereas the outer residual mesh provides direct edge information from the initial layers to the end of the decoder. The Vess-Net takes the original image as the input without pre-processing and provides pixel-wise segmentation in an encoder-decoder manner.

Retinal Blood Vessel Segmentation Using Vess-Net
To extract retinal vessels from the input image, pixel-level classification is required; this is done via continuous convolution until the image is represented by its tiny features. The process of representing images with tiny features involves many convolutions, and in each convolution there is a loss of spatial information. Therefore, to obtain good classification accuracy, the image should not be excessively crushed during convolution [57]. Because deep analysis of retinal vasculature is required for disease diagnosis, the segmentation should be accurate even for tiny vessels. The usual network represents the image using a very small feature map (7 × 7), which is too small for minor information [57]. To detect retinal blood vessels in unconstrained scenarios, the network should maintain the lost spatial information throughout the network; it should be designed so that the feature map is sufficiently detailed to represent maximum features of the blood vessels. The Vess-Net has a minimum feature map size of 27 × 27 for a 447 × 447 input image, which is detailed enough to represent even tiny retinal vessel features. The image classification accuracy with residual networks (ResNet) [58] is higher due to the residual connectivity between the layers, which is

Retinal Blood Vessel Segmentation Using Vess-Net
To extract retinal vessels from the input image, pixel-level classification is required; this is done via continuous convolution until the image is represented by its tiny features. The process of representing images with tiny features involves many convolutions, and in each convolution there is a loss of spatial information. Therefore, to obtain good classification accuracy, the image should not be excessively crushed during convolution [57]. Because deep analysis of retinal vasculature is required for disease diagnosis, the segmentation should be accurate even for tiny vessels. The usual network represents the image using a very small feature map (7 × 7), which is too small for minor information [57]. To detect retinal blood vessels in unconstrained scenarios, the network should maintain the lost spatial information throughout the network; it should be designed so that the feature map is sufficiently detailed to represent maximum features of the blood vessels. The Vess-Net has a minimum feature map size of 27 × 27 for a 447 × 447 input image, which is detailed enough to represent even tiny retinal vessel features. The image classification accuracy with residual networks (ResNet) [58] is higher due to the residual connectivity between the layers, which is beneficial for resolving the vanishing gradient problem [58]. ResNet improves the classification accuracy over visual geometry group networks (VGG nets), which do not use residual skip paths [58,59]. VGG net [59] is the base of SegNet, which also does not use residual skip connections [60]. Vess-Net is a 16-layer semantic segmentation network based on a fully residual encoder-decoder network (FRED-Net) [61]. FRED-Net utilizes the residual connectivity only inside the encoder and decoder, and there are no outer residual connections. The original ResNet [58] only uses the residual connectivity between adjacent layers and there is no decoder in the classification, whereas Vess-Net follows the two streams for the information flow with an encoder-decoder design. Stream 1 is a sequential layer-by-layer path in which adjacent layers of the encoder and decoder are connected by non-identity mapping (only), as shown in Figures 2 and 3. Stream 2 is the external stream in which the encoder layers are directly connected with the corresponding decoder layers to provide high-frequency spatial edge information using identity mapping (only), as shown in Figures 2 and 3. Figure 2 shows how the features are empowered by the two streams to detect tiny retinal blood vessels; the connectivity on the left and right is based on non-identity mapping to implement the feature re-use policy inside the encoder and decoder (Stream 1). In addition, the center connectivity represents Stream 2 which connects the encoder directly to the decoder with identity mapping. ResNet [58] uses both identity and non-identity mapping for classification before fully connected layers; Vess-Net uses a similar concept, but in a different way. It utilizes only non-identity mapping inside the encoder and decoder and only identity mapping for outer residual skip connections; in addition, the outer paths (identity) initiate and terminate inside the non-identity block, as shown in Figures 2 and 3. In Figure 2, the connectivity of each convolution block is represented for both the encoder (left) and decoder (right); each convolution block takes the input, E i /D j , from the previous pooling/unpooling layer (Pool i−1 /Unpool j−1 in Figure 2), and provides the empowered output, Y i /Z i , to the rectified linear unit (ReLU − B i /ReLU − B j ) encoder/decoder. The first and second convolutional layers of encoder are represented by E-Con-A i and E-Con-B i , respectively, with a batch normalization (BN) layer, whereas the first and second convolution layers in each decoder is represented by D-Con-A j and D-Con-B j , respectively. T(E i ) and K(D j ) are the output features after the first convolution for the encoder and decoder, respectively, whereas S(E i ) and S (D j ) are the output features after the second convolution for the encoder and decoder, respectively.  Each outer residual skip path (ORSP-1 to ORSP-4) provides a ( ) feature to the corresponding decoder block.
is the empowered feature from the inner and outer streams that improves the network capabilities to segment minor features of retinal blood vessels.  Each outer residual skip path (ORSP-1 to ORSP-4) provides a ( ) feature to the corresponding decoder block.
is the empowered feature from the inner and outer streams that improves the network capabilities to segment minor features of retinal blood vessels.

Vess-Net Encoder
As shown in Figure 3, the Vess-Net has a total of 16 convolution layers of 3 × 3 size; eight convolution layers are for the encoder and eight for the decoder. Insider the Vess-Net encoder, there are four convolutional blocks, each containing only two convolutional layers, as shown in Figure 3. The number of convolution layers is directly proportional to the depth of the network and number of trainable parameters. Each of the convolutional block in the encoder contains one pooling layer (Pool1-Pool4) in order to down-sample the feature. In addition, these pooling layers provide indices and image size information to the decoder to maintain the feature map at the decoder end. Each convolutional layer has a BN layer and ReLU layer for activation. The non-identity skip connections in the encoder start right after each pooling layer of current convolutional blocks and terminate According Figure 3, the first and the last convolution blocks do not have an inner residual connection because these are the input and output convolutional blocks, respectively, as described in [61]. As stated above, every second ReLU in each convolutional block (except the first) in the encoder receives an empowered feature Y i , which is the resultant feature after the element-wise addition of F (E i ) and S(E i ) by the non-identity residual skip path shown in Figure 2; this is given by the following equation: The "+" sign indicates an element-wise addition and Y i is the enhanced feature available for the activation (ReLU-B i ) after the element-wise addition of features from E-Con-B i (after the sec ond convolution, S(E i )) and the feature from the 1 × 1 convolution in the residual skip path (F (E i )). For the decoder, the situation is completely different because of the outer residual path (non-identity mapping, shown in the middle of Figure 2); the element-wise addition of this feature inside the decoder-side convolutional block provides quality spatial-edge information. At the decoder side, each second convolution obtains the edge-information-enriched feature by identity mapping. This enriched feature, T (D j ) (for example after Point "P" in Figure 3) is available for each second convolution in the decoder block (D-Con-B j ) and is given by Equation (2): T (D j ) is the enriched feature with the spatial edge information. It is the element-wise addition of K(D j ) and T(E i ), which are, respectively, the output feature after the first activation of each decoder block (ReLU-A j ) and Stream 2 features that are directly imported after the first activation of each encoder block (ReLU-A i ) by identity mapping. T (D j ) is available for each second convolution (except the last) in the decoder represented by D-Con-B j in Figure 2. As Vess-Net also involves a dual-stream feature empowerment, the enriched feature T (D j ) is further element-wisely added to Stream 1 (F (E i ) by non-identity mapping for maximum benefit of the feature re-use policy. According to Figure 2, each second ReLU activation in the decoder (ReLU-B j ) receives a dual-stream enhanced feature, Z j , given by the equation: Here, Z j is a dual-stream enhanced feature by both identity and non-identity mapping (for example after Point "Q" in Figure (3)). In addition, S (D j ) is the output feature from each second convolution (except the last) in the decoder (D-Con-B j ) and F (D j ) is the feature from the non-identity residual skip path.
Each outer residual skip path (ORSP-1 to ORSP-4) provides a T(E i ) feature to the corresponding decoder block. Z j is the empowered feature from the inner and outer streams that improves the network capabilities to segment minor features of retinal blood vessels.

Vess-Net Encoder
As shown in Figure 3, the Vess-Net has a total of 16 convolution layers of 3 × 3 size; eight convolution layers are for the encoder and eight for the decoder. Insider the Vess-Net encoder, there are four convolutional blocks, each containing only two convolutional layers, as shown in Figure 3. The number of convolution layers is directly proportional to the depth of the network and number of trainable parameters. Each of the convolutional block in the encoder contains one pooling layer (Pool1-Pool4) in order to down-sample the feature. In addition, these pooling layers provide indices and image size information to the decoder to maintain the feature map at the decoder end. Each convolutional layer has a BN layer and ReLU layer for activation. The non-identity skip connections in the encoder start right after each pooling layer of current convolutional blocks and terminate immediately after the second convolution of next convolutional block in order to avoid the loss of spatial information during the convolution process. The identity skip connections are initiated after the first ReLU of each encoder convolutional block; these paths for identity mapping are directly provided to the corresponding decoder block to provide the spatial edge information. Due to the dual-stream feature empowerment, the Vess-Net encoder provides valuable features to the decoder for up-sampling. Table 2 represents the notable differences between the proposed Vess-Net method, ResNet [58], IrisDenseNet [62], and FRED-Net [61]. Table 2. Key differences between Vess-Net and previous architectures.

Method Other Architectures Vess-Net
ResNet [58] Residual skip path between adjacent layers only Skip connections between adjacent layers and directly between the encoder and decoder Only uses post-activation because ReLU activation is used after element-wise addition Uses the pre-activation and post-activation as ReLU is used before and after the element-wise addition The Vess-Net structure is described in Table 3, which shows that there are 10 residual connections that connect the encoder and decoder, both externally and internally. The inner residual skip connections provide the lost spatial information after the convolution process, whereas the outer path helps converge the network faster by immediately providing spatial information. Table 3. Vess-Net encoder with inner and outer residual skip connections and activation size of each layer (ECB, ECon, ORSP, IRSP, and Pool indicate encoder convolutional block, encoder convolution layer, outer residual kip path, internal residual skip path, and pooling layer, respectively). "**" refers to layers that include both batch normalization (BN) and ReLU. "*" refers to layers that include only BN, and *Pool/*Unpool shows that the pooling/unpooling layer is activated prior to the ReLU layer. Outer residual skip paths (ORSP-1 to ORSP-4) are initiated from the encoder to provide spatial information to the decoder. Vess-Net uses both pre-and post-activation. The table is based on the digital retinal images for vessel extraction (DRIVE) dataset, which has a size of 447 × 447 × 3.

Block
Name

Vess-Net Decoder
As explained in Figure 3, the function of the proposed Vess-Net decoder is to produce an output similar to the input image. To mimic back image, the decoder layers exactly act as the mirror of encoder. That is, the decoder layers perform the reflect action to the encoder and provide the output image size same as input. The Vess-Net decoder has several paths coming from the encoder. The four paths from each pooling layer of the encoder provide the indices and size information to each unpooling layer of the decoder; this helps maintain the feature map of each stage of the decoder according to its corresponding stage in the encoder. There are four outer residual skip paths (ORSP-1 to ORSP-4) from each first convolutional layer of encoder blocks; each ORSP terminates before the second ReLU activation in the decoder. The ORSP provide important spatial edge information to the corresponding decoder layer. In each decoder block, there are two additional layers for element-wise addition: for the IRSP via non-identity mapping (Stream 1) and for the ORSP via identity mapping (Stream 2). Both streams through the IRSP and ORSP are combined in the decoder to ensure reliable features for accurate semantic segmentation of retinal blood vessels in difficult scenarios. The output of the network is in the form of a two-channel mask. As we have two classes (vessel and non-vessel), the number of filters for the last convolutional layer in the decoder is set to 2, which produces two separate masks for vessel and non-vessel classes. The pixel classification layer and the softmax function at the end of decoder assign one predicted class to each pixel in the feature map. Table 4 represents the layer patterns of the proposed Vess-Net decoder. Table 4. Vess-Net decoder with inner and outer residual skip connections and activation size of each layer (DCB, DCon, ORSP, IRSP, and Unpool indicate decoder convolutional block, decoder convolution layer, outer residual skip path, internal residual skip path, and unpooling layer, respectively). "**" refers to layers that include both BN and ReLU. "*" refers to layers with only BN, and *Pool/*Unpool shows that the pooling/unpooling layer is activated prior to the ReLU layer. Outer residual skip paths (ORSP-1 to ORSP-4) are initiated from the encoder to provide spatial information to the decoder. Vess-Net uses both pre-and post-activation. The table is based on the DRIVE dataset, which has a size of 447 × 447 × 3.

Block
Name

Experimental Data and Environment
This research focused on retinal vessel segmentation. Therefore, the performance of the proposed Vess-Net method was tested over a publicly available dataset of digital retinal images for vessel extraction (DRIVE) [63]. The dataset consists of 40 RGB fundus camera images that are already divided equally for training and testing datasets (20 images for training and 20 images for testing). To test the algorithms and methods, the ground truth images of manually segmented retinal vessel are publicly available with the dataset and are provided in [63]. Figure 4 shows examples of images and the corresponding ground truth images for the DRIVE dataset.  In our experiment, we used the training and testing image as given in the dataset. To reduce the graphic processing unit (GPU) memory usage, the images and labels are resized to 447 × 447 pixels. Vess-Net is a semantic segmentation neural network that takes both image and annotation at the same time for the training. Moreover, for better training, we artificially increased the amount of data using image data augmentation, as described in Section 5.2.
The Vess-Net is trained and tested on an Intel ® Core™ i7-3770K CPU @ 3.50 GHz (4 cores), with 28 GB of system RAM, NVIDIA GeForce GTX Titan X GPU with 3072 Cuda cores, and a graphic memory of 12 GB (NVIDIA, Santa Clara, CA, USA) [64]. In our experiments, the proposed model is designed and trained from scratch on our experimental dataset using MATLAB R2019a [65]. Hence, there is no fine-tuning of any pre-trained model such as ResNet, GoogleNet, Inception, or DenseNet etc.

Data Augmentation
To sufficiently train Vess-Net, the training dataset should be large. It is very difficult to train a deep neural network with only 20 training images and obtain reliable segmentation. To train the Vess-Net properly and provide distinctive training examples, data from the 20 images with corresponding ground truths were augmented by artificially creating additional images. Various image transformations such as horizontal flipping, vertical flipping, and crop-resize with nearest neighbor interpolation were used, as shown in Figure 5.
These transformations were performed in three stages. In Stage 1, 20 images were generated by a horizontal flip and 20 images were generated by a vertical flip, resulting in 60 images, as shown in  In our experiment, we used the training and testing image as given in the dataset. To reduce the graphic processing unit (GPU) memory usage, the images and labels are resized to 447 × 447 pixels. Vess-Net is a semantic segmentation neural network that takes both image and annotation at the same time for the training. Moreover, for better training, we artificially increased the amount of data using image data augmentation, as described in Section 5.2.
The Vess-Net is trained and tested on an Intel ® Core™ i7-3770K CPU @ 3.50 GHz (4 cores), with 28 GB of system RAM, NVIDIA GeForce GTX Titan X GPU with 3072 Cuda cores, and a graphic memory of 12 GB (NVIDIA, Santa Clara, CA, USA) [64]. In our experiments, the proposed model is designed and trained from scratch on our experimental dataset using MATLAB R2019a [65]. Hence, there is no fine-tuning of any pre-trained model such as ResNet, GoogleNet, Inception, or DenseNet etc.

Data Augmentation
To sufficiently train Vess-Net, the training dataset should be large. It is very difficult to train a deep neural network with only 20 training images and obtain reliable segmentation. To train the Vess-Net properly and provide distinctive training examples, data from the 20 images with corresponding ground truths were augmented by artificially creating additional images. Various image transformations such as horizontal flipping, vertical flipping, and crop-resize with nearest neighbor interpolation were used, as shown in Figure 5.

Vess-Net Training
Vess-Net is a fully convolutional network with dual-stream information flow that allows it to converge faster. This type of connectivity allows the network to train without pre-processing of data. As Vess-Net is our own designed network, the training was performed from scratch without any weight initialization or fine-tuning. Adam was chosen as the optimizer as it is a more sophisticated version of the stochastic gradient descent (SGD) (SGD is based on a first-order gradient-based method). Adam is computationally more efficient compared to conventional SGD because of its efficacy for diagonal scaling of the gradients [66]. In this study, the Adam optimizer had an initial learning rate of 0.0005, which was maintained during the training with a mini-batch-size of seven images. The gradient threshold method using global L2 normalization with an epsilon of 0.000001 was adopted. As Vess-Net empowers features with dual stream, it was trained for only 15 epochs, with shuffling of images in each epoch to maintain the variety of learning. Cross-entropy loss was used with median-frequency class balancing to eliminate the effect of class imbalance, as described in [62]. Figure 6 shows the training accuracy and loss curves for Vess-Net. The x-axis represents the number of epochs, the left y-axis represents the training loss, and right y-axis represents the training accuracy. The represented accuracy and loss were based on the mini-batch, which shows the training accuracy and training loss per epoch, respectively. With training for 15 epochs and an initial learning rate of 0.0005, Vess-Net showed a training accuracy of 96% and training loss of approximately 0.06. Training for more epochs did not result in any further increase in accuracy or reduction in loss. As stated in Section 3, the Vess-Net trained models will be made publicly available for fair comparison with other studies via [56].

Vess-Net Training
Vess-Net is a fully convolutional network with dual-stream information flow that allows it to converge faster. This type of connectivity allows the network to train without pre-processing of data. As Vess-Net is our own designed network, the training was performed from scratch without any weight initialization or fine-tuning. Adam was chosen as the optimizer as it is a more sophisticated version of the stochastic gradient descent (SGD) (SGD is based on a first-order gradient-based method). Adam is computationally more efficient compared to conventional SGD because of its efficacy for diagonal scaling of the gradients [66]. In this study, the Adam optimizer had an initial learning rate of 0.0005, which was maintained during the training with a mini-batch-size of seven images. The gradient threshold method using global L2 normalization with an epsilon of 0.000001 was adopted. As Vess-Net empowers features with dual stream, it was trained for only 15 epochs, with shuffling of images in each epoch to maintain the variety of learning. Cross-entropy loss was used with median-frequency class balancing to eliminate the effect of class imbalance, as described in [62]. Figure 6 shows the training accuracy and loss curves for Vess-Net. The x-axis represents the number of epochs, the left y-axis represents the training loss, and right y-axis represents the training accuracy. The represented accuracy and loss were based on the mini-batch, which shows the training accuracy and training loss per epoch, respectively. With training for 15 epochs and an initial learning rate of 0.0005, Vess-Net showed a training accuracy of 96% and training loss of approximately 0.06. Training for more epochs did not result in any further increase in accuracy or reduction in loss. As stated in Section 3, the Vess-Net trained models will be made publicly available for fair comparison with other studies via [56].

Vess-Net Testing for Retinal Vessel Segmentation
Vess-Net is trained without pre-processing, and the input image is directly provided for the testing phase without prior pre-processing. The testing image takes both streams (inner and outer) to

Vess-Net Testing for Retinal Vessel Segmentation
Vess-Net is trained without pre-processing, and the input image is directly provided for the testing phase without prior pre-processing. The testing image takes both streams (inner and outer) to empower features using six internal skip paths with non-identity mapping (IRSP-1 to IRSP-6) and four outer skip paths with identity mapping (ORSP-1 to ORSP-4). The output from Vess-Net provides two masks: for vessel and non-vessel classes. The last 3 × 3 convolution is with the two filters representing both classes. To evaluate and compare our proposed Vess-Net on the DRIVE dataset with other methods, we adopted sensitivity (Se), specificity (Sp), accuracy (Acc), and area under the curve (AUC) as evaluation metrics. The formulas for Se, Sp, and Acc are given by Equations (4)- (6): where tp, fn, tn, and fp are the numbers of true positives, false negatives, true negatives, and false positives, respectively. Here, tp is a pixel that is listed as a vessel pixel in the ground truth and predicted as a vessel pixel by our network, whereas fn is a pixel that is listed as a vessel pixel in ground truth but predicted by the network as a non-vessel pixel. tn is a pixel that is listed as a non-vessel pixel and correctly predicted as a non-vessel by the network, whereas fp is a non-vessel pixel in the ground truth and is predicted as a vessel pixel by our network. Figure 7 presents the visual results of vessel segmentation by Vess-Net with the DRIVE dataset. Figure 7 includes cases of fp (shown in green), fn (shown in red), and tp (shown in blue). As shown in the figure, there was no significant error or no-segmentation case for the test images.

Comparison of Vess-Net with Previous Methods
This section provides comparisons between Vess-Net and other methods based on the evaluation metrics highlighted in Section 5.4.1. Table 5 presents the comparisons of the results obtained by local feature-based methods and learned feature-based methods with those obtained by Vess-Net for the DRIVE dataset. The results confirm the higher performance of Vess-Net for retinal vessel segmentation compared to existing methods, based on the values of AUC and Acc. Figure 7 presents the visual results of vessel segmentation by Vess-Net with the DRIVE dataset. Figure 7 includes cases of fp (shown in green), fn (shown in red), and tp (shown in blue). As shown in the figure, there was no significant error or no-segmentation case for the test images.

Vessel Segmentation with Other Open Datasets Using Vess-Net
To evaluate the performance of Vess-Net in different situations, this study included experiments with two more open datasets: Child Heart Health Study in England (CHASE-DB1) [67] and structured analysis of retina (STARE) [68] for retinal vessel segmentation. CHASE-DB1 consists of l 28 images of 14 schoolchildren captured using a Nidek NM-200-D fundus camera with a 30 • field of view. The STARE dataset consists of 20 images captured using a TopCon TRV-50 fundus camera with a 35 • field of view. Examples of images for the CHASE-DB1 and STARE datasets are shown in Figure 8a,b, respectively. The manual segmentation mask for both CHASE-DB1 and STARE (first observer) were used as ground truths. In the experiment with CHASE-DB1, half of the images (14 images) were used for training with data augmentation (described in Section 5.2), while the other half (14 images) were used for testing with a two-fold cross validation. The overall performance was computed by averaging the two experimental results. For the STARE dataset, the experiment was repeated 20 times by selecting one image for testing and the other 19 images for training (leave-one-out validation) with data augmentation (described in Section 5.2). The overall performance was computed by averaging the 20 experimental results. data augmentation (described in Section 5.2). The overall performance was computed by averaging the 20 experimental results. Tables 6 and 7 present the comparison between local feature-based methods and learned featurebased methods and the proposed Vess-Net for CHASE-DB1 and STARE datasets, respectively. The results confirm the higher performance of Vess-Net for retinal vessel segmentation compared to existing methods. Tables 6 and 7 present the comparison between local feature-based methods and learned feature-based methods and the proposed Vess-Net for CHASE-DB1 and STARE datasets, respectively. The results confirm the higher performance of Vess-Net for retinal vessel segmentation compared to existing methods.      The reason we performed the training and testing for the three separate databases is for the fair comparisons with the previous studies (based on same experimental protocol) as shown in Tables 5-7. As shown in these tables, our method outperformed the state-of-the-art methods with the three separate databases. To test the portability of our network, additional experiments were performed, in which the model was trained with the images of DRIVE [63] and CHASE-DB1 [67] datasets and tested with all the images of STARE dataset [68] independently. Table 8 shows the accuracies of Vess-Net. By comparing the accuracies in the case of three separate databases, as shown in Tables 5-7, with those of Table 8, the degradation of accuracies are very small, which confirms the portability of our network.

Detection of Diabetic or Hypertensive Retinopathy
As described in the Section 1, diabetic and hypertensive retinopathy can be detected with the help of accurate vessel segmentation, which is an intensive task for the medical specialist to do it manually [1]. Two retinal images of consecutive visits can be compared with the help of accurate segmentation algorithm. The proposed method provides accurate binary segmentation mask with pixel values of "0" and "1". The number of pixels which are labeled as vessel pixel (marked as "1") can be counted. If the number of vessel pixels is more than that of the previously registered image, it represents the swelling or creation of new blood vessels (which shows the presence of diabetic retinopathy [8,9]). If the number of vessel pixels is less than that of the previously registered image, it represents the shrinkage of blood vessels (which shows the presence of hypertensive retinopathy [10,11]). Figure 11 shows an example binary segmentation mask detected by the proposed method. In this case, the number of vessel pixels is 19,551 out of total 199,809 pixels in the image of 447 × 447 pixels. This pixel count can be used as biomarker to detect both diabetic and hypertensive retinopathy with appropriate threshold. However, the accuracy of this detection is totally dependent on the correct segmentation of vessels. The Se, Sp, AUC and Acc are the evaluation metrics to judge the correctness of the vessel segmentation which can produce the correct pixel count for the detection of diabetic or hypertensive retinopathy. The small vessels caused by the creation of new blood vessels are equally important to be segmented as these can be caused by the diabetic retinopathy, thus the segmentation algorithm should have sufficiently good performance to detect even small vessels.
it represents the shrinkage of blood vessels (which shows the presence of hypertensive retinopathy [10,11]). Figure 11 shows an example binary segmentation mask detected by the proposed method. In this case, the number of vessel pixels is 19,551 out of total 199,809 pixels in the image of 447 × 447 pixels. This pixel count can be used as biomarker to detect both diabetic and hypertensive retinopathy with appropriate threshold. However, the accuracy of this detection is totally dependent on the correct segmentation of vessels. The Se, Sp, AUC and Acc are the evaluation metrics to judge the correctness of the vessel segmentation which can produce the correct pixel count for the detection of diabetic or hypertensive retinopathy. The small vessels caused by the creation of new blood vessels are equally important to be segmented as these can be caused by the diabetic retinopathy, thus the segmentation algorithm should have sufficiently good performance to detect even small vessels. As shown in Figure 11, the ratio of non-vessel (negative data) vs. vessel pixels (positive data) is 9.2:1 (180,258 vs. 19,551). As shown in Equations (4) and (5), Se is calculated as the number of correctly detected vessel pixels (tp) over the number of whole vessel pixels (positive data), which shows the detection accuracy of vessel pixels, while Sp is calculated as the number of correctly detected nonvessel pixels (tn) over the number of whole non-vessel pixels (negative data), which shows the detection accuracy of non-vessel pixels. As shown in Tables 5-8, Sp is higher than Se. However, in the experiments presented in Tables 5-7, we compared Sp, Se, and Acc (which shows the detection accuracy of both vessel and non-vessel pixels as shown in Equation (6)), which have been widely used as evaluation metric in previous studies, by our method with those by existing methods. In addition, as shown in these tables, our method outperformed the state-of-the-art methods.
As shown in Figures 7, 9 and 10, errors (fn) still exist for the small vessels by our method, but these errors can be compensated by the additional help of medical expert. For example, our system can be used as diagnostic method of the first step, and only the suspicious images can be checked again by medical expert as the second step. By using this scheme of two-step diagnosis, the correct prediction of diabetic and hypertensive retinopathy can be enhanced while lessening the diagnostic burden of medical expert.

Discussion
In this study, a new technique is introduced to take advantage of the feature re-use policy. To compensate for the lost spatial information during the continuous convolution process and strengthen the feature traveling through the network, two residual streams are used. Stream 1 uses a series-type flow to empower the features by importing the information from the previous layer using non-identity skip paths, whereas Stream 2 is a direct stream from the encoder to decoder so that the edge information on each level has dedicated identity residual paths. To explain the effect of two streams, Figure 12 presents the feature maps from the decoder at three points. Points P and Q are from DCB-2, as shown in Table 4 and Figure 3. The feature maps are extracted before point P i.e., before Stream 2 (shown in Figure 12a), after point P i.e., empowered features with Stream 2 (shown in Figure 12b), and after point Q i.e., combined features of Streams 1 and 2 (shown in Figure 12c). Figure 12b,c clearly show that both streams significantly enhance the features for reliable segmentation of retinal vessels. Note that in Figure 12, the total number of DCB-2 channels is 128 but As shown in Figure 11, the ratio of non-vessel (negative data) vs. vessel pixels (positive data) is 9.2:1 (180,258 vs. 19,551). As shown in Equations (4) and (5), Se is calculated as the number of correctly detected vessel pixels (tp) over the number of whole vessel pixels (positive data), which shows the detection accuracy of vessel pixels, while Sp is calculated as the number of correctly detected non-vessel pixels (tn) over the number of whole non-vessel pixels (negative data), which shows the detection accuracy of non-vessel pixels. As shown in Tables 5-8, Sp is higher than Se. However, in the experiments presented in Tables 5-7, we compared Sp, Se, and Acc (which shows the detection accuracy of both vessel and non-vessel pixels as shown in Equation (6)), which have been widely used as evaluation metric in previous studies, by our method with those by existing methods. In addition, as shown in these tables, our method outperformed the state-of-the-art methods.
As shown in Figures 7, 9 and 10, errors (fn) still exist for the small vessels by our method, but these errors can be compensated by the additional help of medical expert. For example, our system can be used as diagnostic method of the first step, and only the suspicious images can be checked again by medical expert as the second step. By using this scheme of two-step diagnosis, the correct prediction of diabetic and hypertensive retinopathy can be enhanced while lessening the diagnostic burden of medical expert.

Discussion
In this study, a new technique is introduced to take advantage of the feature re-use policy. To compensate for the lost spatial information during the continuous convolution process and strengthen the feature traveling through the network, two residual streams are used. Stream 1 uses a series-type flow to empower the features by importing the information from the previous layer using non-identity skip paths, whereas Stream 2 is a direct stream from the encoder to decoder so that the edge information on each level has dedicated identity residual paths. To explain the effect of two streams, Figure 12 presents the feature maps from the decoder at three points. Points P and Q are from DCB-2, as shown in Table 4 and Figure 3. The feature maps are extracted before point P i.e., before Stream 2 (shown in Figure 12a), after point P i.e., empowered features with Stream 2 (shown in Figure 12b), and after point Q i.e., combined features of Streams 1 and 2 (shown in Figure 12c). Figure 12b,c clearly show that both streams significantly enhance the features for reliable segmentation of retinal vessels. Note that in Figure 12, the total number of DCB-2 channels is 128 but only the first 32 channels are shown for convenience. The important observations from our network are as follows: -Vess-Net is empowered by IRSPs and ORSPs, which enables the network to provide high accuracy with few convolution layers (only 16). -With provision of direct spatial edge information, the network is pushed to converge rapidly, i.e., in only 15 epochs (3075 iterations). -Vess-Net is designed in a way that it maintains the minimal feature map size at 27 × 27 (as shown in Table 3), which is sufficient to represent tiny vessels that are created due to diabetic retinopathy. -Vess-Net is empowered by IRSPs and ORSPs, which enables the network to provide high accuracy with few convolution layers (only 16). -With provision of direct spatial edge information, the network is pushed to converge rapidly, i.e., in only 15 epochs (3075 iterations). -Vess-Net is designed in a way that it maintains the minimal feature map size at 27 × 27 (as shown in Table 3), which is sufficient to represent tiny vessels that are created due to diabetic retinopathy.

Conclusions
This study proposed a dual-stream feature empowerment network (Vess-Net) for retinal vessel segmentation in non-ideal scenarios. The fundus images have very low pixel intensities for retinal vessels, which make them similar to the background and results in difficult segmentation. The proposed Vess-Net method has a two-way information flow that the differentiates between vessel and non-vessel classes even in the presence of continuous convolutions, which tend to lose spatial information in each stage. In the absence of these residual skip paths, tiny vessel information would be lost as the gradient vanishes and those tiny vessels are important for the diagnosis of diabetic retinopathy. To preserve tiny vessels, the incorporation of features from preceding layers results in a significantly enhanced segmentation process. Moreover, with this design, the network is powerful and can segment minor information with few layers. The direct connection from the encoder to decoder to provide edge information makes the network converge rapidly and substantially reduces the number of trainable parameters with fine segmentation of the vessels, which is important to compute vessel pixel count to detect the diabetic or hypertensive retinopathy. One of the most important characteristics of the proposed method is avoiding pre-processing overheads, so that original images can be provided to the network without conventional pre-processing for training and testing.
Vess-Net is creatively supported by inner and outer residual skip paths. Our future goal is to create a similar network with different mapping options that can provide a sufficiently good segmentation performance with fewer trainable parameters. In addition, this network can be used for semantic segmentation in various domains.

Conclusions
This study proposed a dual-stream feature empowerment network (Vess-Net) for retinal vessel segmentation in non-ideal scenarios. The fundus images have very low pixel intensities for retinal vessels, which make them similar to the background and results in difficult segmentation. The proposed Vess-Net method has a two-way information flow that the differentiates between vessel and non-vessel classes even in the presence of continuous convolutions, which tend to lose spatial information in each stage. In the absence of these residual skip paths, tiny vessel information would be lost as the gradient vanishes and those tiny vessels are important for the diagnosis of diabetic retinopathy. To preserve tiny vessels, the incorporation of features from preceding layers results in a significantly enhanced segmentation process. Moreover, with this design, the network is powerful and can segment minor information with few layers. The direct connection from the encoder to decoder to provide edge information makes the network converge rapidly and substantially reduces the number of trainable parameters with fine segmentation of the vessels, which is important to compute vessel pixel count to detect the diabetic or hypertensive retinopathy. One of the most important characteristics of the proposed method is avoiding pre-processing overheads, so that original images can be provided to the network without conventional pre-processing for training and testing.
Vess-Net is creatively supported by inner and outer residual skip paths. Our future goal is to create a similar network with different mapping options that can provide a sufficiently good segmentation performance with fewer trainable parameters. In addition, this network can be used for semantic segmentation in various domains.

Conflicts of Interest:
The authors declare no conflict of interest.