G-Net Light: A Lightweight Modiﬁed Google Net for Retinal Vessel Segmentation

: In recent years, convolutional neural network architectures have become increasingly complex to achieve improved performance on well-known benchmark datasets. In this research, we have introduced G-Net light, a lightweight modiﬁed GoogleNet with improved ﬁlter count per layer to reduce feature overlaps, hence reducing the complexity. Additionally, by limiting the amount of pooling layers in the proposed architecture, we have exploited the skip connections to minimize the spatial information loss. The suggested architecture is analysed using three publicly available datasets for retinal vessel segmentation, namely DRIVE, CHASE and STARE datasets. The proposed G-Net light achieves an average accuracy of 0.9686, 0.9726, 0.9730 and F 1 - score of 0.8202, 0.8048, 0.8178 on DRIVE, CHASE, and STARE datasets, respectively. The proposed G-Net light achieves state-of-the-art performance and outperforms other lightweight vessel segmentation architectures with fewer trainable number of parameters.


Introduction
Diabetic retinopathy (DR) has gained a great deal of attention recently due to its connection with long-standing diabetes, which is one of the most common causes of avoidable blindness in the world [1,2]. Additionally, diabetic retinopathy is one of the major contributors of vision loss, especially in those of working age [3,4]. Lesions are the first signs of diabetic retinopathy. They include exudates, microaneurysms, haemorrhages, vessel abnormalities and leakages [5][6][7]. The number and type of lesions that form on the surface of the retina affect the severity and diagnosis of the disease. Thus, the effectiveness of an automated system for extensive screening may depend on the precision of segmenting blood vessels, optical cup/disc and retinal lesions [8]. Along these lines, it has long been thought that detecting retinal blood vessels is the most difficult problem, and it is frequently thought that it is the most crucial part of an automated computer-aided diagnostic (CAD) system [1,9]. This is because the vessels in the retina are hard to see because of their tortuous shape, density, diameter and branching pattern [10]. Even more challenging to identify are the centerline reflex and the many components that make up the retina, including the macula, optic cup/disc, exudates and so on, all of which may have lesions or other flaws. Finally, the settings used for camera calibration and the acquisition method can also bring unpredictability into the imaging process.
For the purpose of blood vessels segmentation, when a machine learning or deep learning architecture is used, training is usually conducted using a dataset of manually labelled segmented images [11]. In order to diagnose serious disorders including retinal vascular occlusions [12], glaucoma [13], AMD [14], DR [15] and chronic systematic hypoxemia (CSH) [16], these techniques have been used to detect retinal vessels. Kadri et al. [17] introduced a multi-scale filter (MSMF) utilising the slime mould optimization technique. Furthermore, deep learning-based approaches have attained cutting-edge accuracy in applications including vessel detection and optic cup/disc detection [18]. Therefore, it is believed that the predominant technique for creating retinal diagnostics systems is now supervised machine learning models [19,20]. Despite considerable success of supervised ML models [21,22], it is still challenging to find blood vessels when there are noticeable differences and abnormalities. It becomes significantly more challenging when the vessels' diameter is small. Moreover, training these architectures is time-consuming, despite the fact that the results of supervised segmentation obtained by these methods are superior to those of unsupervised segmentation. The lack of comprehensively (labelled) data for a variety of ailments and imaging modalities makes this more challenging.
According to [23,24], the usage of intricate CNN architecture based models does not produce the optimal results for the majority of segmentation algorithms. Keep in mind that the quantity of hidden layers and the number of filters used in each layer have a significant impact on the number of trainable parameters. In these circumstances, shallow networks are frequently suggested as a deep network substitute [20]. In comparison to their deep counterparts, these shallow networks employ fewer filters per layer. Our network's layout is intended to utilize the most filters possible in each layer while minimizing the complexity of the system as a whole. If an image has less feature variation, performance does not rise with more filters in a convolution layer, but complexity does [25,26]. By recommending smallscale networks with fewer layers, convolution networks' complexity has been lowered in the literature [27][28][29][30][31][32]. Furthermore, the significance in terms of the performance and complexity is not addressed in [27]. Here, the characteristic complexity is used to determine the number of filters.
To the best of our knowledge, GoogleNet based encoder-decoder architecture for image segmentation is not proposed so far, hence one of the major contributions of the proposed work is to design a decoder of GoogleNet. Inspired from the GoogleNet [33], this study introduces G-Net lite, a simple yet effective small scale neural network architecture for retinal blood vessels' segmentation. This is because G-Net light only has a small number of parameters, which means that it requires relatively lesser memory and GPU resources than alternatives with significantly higher parameters. In addition, the encoder employs only two max-pooling layers to reduce the spatial information loss. Experiments are conducted on three different datasets of retinal blood vessels segmentation to demonstrate the efficacy of the proposed architecture for medical image segmentation.

Related Work
Recent research has been presented where the U-Net structure is extended with changing module design and network building, demonstrating its potential on numerous visual analysis tasks. V-Net [34] extends U-Net to higher dimension pixels while retaining the vanilla internal structures. W-Net [35] adapts U-Net to address the problem of unsupervised segmentation by concatenating two U-Nets using an autoencoder style model. In contrast to U-Net, M-Net [36] appends different scales of input characteristics to different levels, allowing a sequence of downsampling and upsampling layers to capture multi-level visual information. U-Net++ [37] has recently adopted nested and dense skip connections to more efficiently depict fine-grained object information. Furthermore, attention U-Net [38] employs extra branches to apply the attention mechanism adaptively on the fusion of skipped and decoded data. However, these suggestions may include extra building pieces, resulting in a bigger number of network parameters and, as a result, more GPU RAM. It has been established that using recurrent convolution to repeatedly modify the features extracted at different periods is feasible and successful for many computer vision problems [39][40][41][42]. Guo et al. [39] advocated reusing residual blocks in ResNet to completely utilise available parameters and greatly reduce model size. Such a mechanism is also beneficial to the evolution of U-Net. As a result, Wang et al. [42] created R-U-Net, which recurrently connects multiple paired encoders and decoders of U-Net to improve its discrimination strength for semantic segmentation; however, as a trade-off, extra learnable blocks are included.

G-Net Light
This section presents and explains the the proposed network architecture. In Figure 1, overall architecture of G-Net light is presented. The proposed network starts with an input image layer, then a convolutional layer, and finally the essential final layers that create the pixel-wise segmentation map. We have performed nonlinear activations (ReLU) on the segmentation map. The feature maps are then fed into the max-pooling layer. The inception block is used after the max-pooling layer, followed by another max-pooling layer. There is an inception block which connects the encoder and decoder blocks. At the decoder side, the up-sampling layer (max-unpooling) is used followed by the same inception block, another up-sampling layer and another inception block. Once the spatial information is restored using up-sampling layers, a convolutional layer (CL) followed by nonlinear activations (ReLU), and the batch normalization layer (BN) is applied. After a soft-max layer, the final classification layer is a dice pixel classification layer. Note that the proposed architecture has four inception blocks, where the first block is used after the first downsampling. There is an intermediary inception block that connects the encoder and decoder blocks. There are two inception blocks at decoder followed by the convolutional layer, which is supplied with the necessary final layers required for constructing the pixel-wise segmentation map. Using the convolution layers in between the filter banks and input feature maps, each encoder block creates its own collection of features. We have performed nonlinear activations (ReLU) on these features. Depending on whether the block is upsampling or down-sampling, the produced feature maps are subsequently supplied to the max-pooling or unpooling layers. All max-pooling and unpooling layers are 2 × 2, non-overlapping, with a stride size of 2. It is worth noting that the proposed network design responds to multiple motivations. To begin, we wanted to use as few pooling layers as possible in the proposed architecture. This is due to the fact that pooling frequently reduces the size of the feature maps and can also result in a spatial information loss. Second, we have used a limited number of convolutional layers. Finally, within each layer, the total number of convolutional filters are minimized. Skip connections have been used between the encoder and the associated decoder blocks to preserve structural information. Figure 1 depicts these as dotted lines with arrowheads. Another motivating force behind the choice to adopt skip connections as an alternative to dense skip paths is the assumption that feature retention within each convolutional layers may assist with reducing the semantic gap of the encoder side and decoder side while keeping computational overhead under control. In order to preserve fine-grained structures, which are frequently important in medical image segmentation, the number of pooling layers is reduced in the proposed network.

The Inception Block
The key idea of the inception block is to apply the dimension reductions wisely. These reductions are computed using the 1 × 1 filter size for the convolution operations prior to the 3 × 3 and 5 × 5 filter size for the convolutions operations. They are dualpurpose because, in addition to being utilised as reductions, they also utilise rectified linear activation. Figure 2 depicts the ultimate design of the inception block. An Inception block generally is an architecture made up of the above-mentioned modules that are vertically stacked with intermittent max-pooling layers with stride 2 that result in the reduction of the grid's resolution. It appeared preferable to start using the inception blocks only at higher layers and leave the lower layers in typical convolutional form for technical reasons during the training. One of this architecture's key benefits is that it permits significant increases in the number of units at each step without increasing the complexity in terms of computations. The widespread use of reduction of the dimensions enables hiding the high volume of input filters from the preceding stage to the succeeding layer. This is achieved by initially lowering their dimension prior to convolving over them with a large patch size. This method also adheres to the idea that visual data should be processed at various scales before being aggregated, allowing the subsequent stage to simultaneously extract features from different scales. Because the processing resources are being used more efficiently, it is possible to increase both the total number of stages and the width of each stage without encountering computational challenges. Developing significantly less effective but computationally less expensive variations of the inception block is another way to use it. It can be seen that all of the available knobs and levers enable a controlled balancing of computational resources. This can lead to architectures that are twice as fast or three times as fast as similarly performing networks without the inception blocks, though this requires a careful manual design.

Datasets
For the segmentation of retinal vessels, we tested our proposed network using three public image data sets: DRIVE [43], CHASE [44] and STARE [45]. DRIVE [43] is made up of 20 colour images for testing and 20 colour images for training, both of which are saved 584 × 565 image size in JPEG format and cover a wide range of age of DR patients. A field of view (FOV) binary mask is available for all images. Both the test and training images contain manually segmented ground truth vessels' labels.
The CHASE [44] dataset includes 28 colour images acquired with a 30 • FOV centred at the optical disc and an image resolution of 999 × 960 pixels. Two distinct manually segmentation ground truth maps are available. For the experiments, the first expert's segmentation map is used.
The STARE [45] dataset consists of 20 colour retinal fundus images with a size of 700 × 605 pixels per image that were taken at a 35 • FOV. Each of these images has two separate manual segmentation maps available. Here, we have used the initial ophthalmologist segmentation as the benchmark.

Implementation and Training
All of our studies have been run using a GeForce GTX2080TI GPU and an Intel(R) Xeon(R) W-2133 3.6 GHz CPU with 96GB RAM. With a fixed learning rate, stochastic gradient descent was used in our RC-Net implementation. A weighted cross-entropy loss is employed as an objective function for training in all of our experiments. This decision was made after it was discovered that, in each retinal image's vessel segmentation, the nonvessel pixels outweighed the vessel pixels by a significant margin. Various techniques can be employed to assign the loss weights. Here, we use median frequency balancing to determine class association weights [46]. Note that STARE and CHASE datasets do not have a specified test set available. In the literature, a "leave-one-out" strategy is frequently utilised for STARE [47]. With 10 images for training and 10 for testing, we have employed "leave-one-out" data split in this paper. We have also employed data augmentation to generate sufficient images for training since the retinal vascular segmentation image datasets used are relatively small. Contrast enhancement and rotations were utilised for the data augmentation. Each image is rotated by 1 • for the rotations at the training stage. The image brightness was randomly increased and decreased to enhance the contrast.

Evaluation Criteria
Remember that pixel markings on blood vessels segmentation are binary, indicating whether a pixel is a vessel or the background. Publicly accessible datasets include ground truth that is manually annotated by experienced clinicians. As a result, each pixel is categorized as vessel pixel, if the area of interest is present in a image such as blood retinal vessels.There can be four possible outcomes for each output image: pixels that are correctly categorized as areas of interest (T P : true positive), pixels that are correctly categorized as non-interest (T N : true negative), pixels of non-interest that were incorrectly categorized (F P : false positive), and finally area of interest pixels that were falsely categorized as such ((F N ): false negative). Four commonly used performance parameters Accuracy, Sensitivity, Specificity, and F 1 -score are frequently used in the literature to compare approaches using these components: The term accuracy (A cc ) in Equation (1) denotes the proportion of successfully segmented pixels to all of the pixels in the expertly annotated (labelled) mask. The Sp and Se indicate the model's specificity and sensitivity, which demonstrate how the no-vessel and vessel pixels are correctly distinguished and given in Equations (2) and (3), respectively: The F 1 -score, which is the harmonic mean of S n and precision, is another technique to assess the model's performance and can be calculated using Equation (4):

Analysis of the Results and Comparisons
The qualitative and quantitative analysis of the proposed architecture with a number of commonly used alternatives methods in retinal image segmentation is included in this section. Table 1 shows the summary of quantitative performance of the proposed G-Net light w.r.t the ground-truths marked by different observers on DRIVE, CHASE and STARE dataset. The average performance on each data set is also shown in Table 1. The qualitative segmentation findings for the retinal vessels on the DRIVE dataset are analysed and discussed first. In Figure 3, the analysis of the segmented output is illustrated. In Figure 3a, noisy test images 3, 4 and 19 from the DRIVE dataset are presented. Corresponding ground truth images of the 1st observer are given in Figure 3b. Figure 3c,d presents the output of the networks SegNet [48] and U-Net [49], respectively. The segmentation output of the proposed architecture is given in Figure 3e. The segmentation maps' black and green colours represent accurately predicted pixels, whereas the blue and red colours represent false negatives and false positives, respectively. It is apparent that the suggested G-Net Light outperforms the U-Net [49] and SegNet [48] in terms of visual performance. The segmentation maps' black and green colours represent accurately predicted pixels, whereas the blue and red colours represent false negatives and false positives, respectively. It can be clearly observed that the visual performance of the proposed G-Net Light is better than the SegNet [48] and U-Net [49].
The vessel segmented maps of the proposed architecture on CHASE dataset are given in Figure 4. The segmentation maps' black and green colours represent accurately predicted pixels, whereas the blue and red colours represent false negatives and false positives, respectively. In the 1st row, noisy images of CHASE dataset are illustrated, and the corresponding ground truth images marked by 1st observer are shown in the 2nd row of the Figure 4. The final vessels' segmented vessels map images of the proposed architecture are shown in the 3rd row of Figure 4.  Analysis of segmented output. The segmentation maps' black and green colours represent accurately predicted pixels, whereas the blue and red colours represent false negatives and false positives, respectively: in row one, noisy test images of CHASE dataset; in row two, corresponding ground truth images marked by 1st observer. In row three, the output of the proposed network is presented. Tables 2 and 3 compare the G-Net light network performance to some state-of-the-art supervised approaches. The proposed architecture obtains an average sensitivity of 81.92% for the DRIVE database and 82.10% for the CHASE database. In terms of the sensitivity parameter, the proposed G-Net light architecture outperforms all other techniques on the DRIVE dataset and is the 3rd highest on CHASE dataset. The average accuracies of the proposed G-Net light are 96.86% and 97.26%, the highest on the DRIVE and CHASE datasets, respectively. The proposed architecture achieves an average specificity of 98.29% on DRIVE and 98.38% on CHASE, the 3rd and 2nd highest, respectively. Finally, the proposed network achieves 82.02% of F 1 -score, the highest on DRIVE dataset and the 3rd highest value of 80.48% on CHASE dataset. Table 2. Comparison results on DRIVE dataset. Red is the best, green is the 2nd best, and blue is the 3rd best.

Method
Year S n S p A cc F 1 -Score In Figure 5, the analysis of the segmented output is illustrated. In Figure 5a, noisy test images from STARE dataset are presented. Corresponding ground truth images marked by Adam Hoover are given in Figure 5b. Figure 5c,d presents the output of the networks SegNet [48] and U-Net [49], respectively. The vessels' segmentation maps of the proposed architecture are shown in Figure 5e.The segmentation maps' black and green colours represent accurately predicted pixels, whereas the blue and red colours represent false negatives and false positives, respectively. It can be clearly observed that the visual performance of the proposed G-Net Light is better than the SegNet [48] and U-Net [49]. Table 4 compares the proposed network performance to state-of-the-art supervised approaches. The proposed architecture obtains an average sensitivity of 81.70%, which is 2nd highest among the all methods. The average accuracy of the proposed G-Net light architecture is 97.30%, which is 3rd highest. Finally, the proposed network achieves 81.78% of F 1 -score, the 2nd highest among the all methods on the STARE dataset.  Comparisons of the segmentation of retinal vessels using the proposed G-Net light and current lightweight networks in terms of learnable parameters and quantitative performance are also carried out in Table 5. The accuracy and F 1 -score results are compared from G-Net light to current lightweight networks on the DRIVE, CHASE and STARE datasets. Table 5 shows that G-Net light outperforms the state-of-the-art alternatives in terms of accuracy and F 1 -score with a minimal learnable parameters. In Figure 6, the analysis of the quantitative results is illustrated. In Figure 6a, comparison of the quantitative results of G-Net light with other methods on DRIVE dataset are given. Figure 6b,c, shows the comparison of the quantitative results of G-Net light with other methods on CHASE and STARE datasets, respectively. It can be observed from Figure 6 that the performance of the proposed G-Net light is clearly comparable with the other state-of-the-art methods.

Discussion
A sizeable portion of the global population is affected by a variety of retinal illnesses that can compromise one's vision. This significant worry has arisen in part as a result of the high cost of the necessary equipment that is required for the diagnosis of ophthalmological diseases, and in part as a result of the scarcity of ophthalmological specialists who are readily available. It is essential to make a prompt diagnosis of these retinal illnesses in order to avert vision loss and blindness. In this regard, accessible computer-aided diagnostic techniques have the potential to play a pivotal role. The majority of the deep learning models that have been developed for the diagnosis of retinal disorders function effectively, despite the fact that they computationally expensive. This constitutes a major obstacle in the way of the deployment of such models on portable edge devices. Therefore, the proposed lightweight model for the segmentation of retinal vessels can play a vital role in the development of computationally less expensive diagnostic systems. The proposed model uses significantly less trainable parameters without sacrificing performance.

Conclusions
In this research paper, we have introduced and analyzed G-Net light, a lightweight modified GoogleNet with improved filter count per layer to reduce feature overlaps and complexity. Additionally, by reducing the amount of pooling layers in the proposed architecture, we have exploited the skip connections to minimize the spatial information loss. Our investigations are examined on publicly available DRIVE, CHASE and STARE datasets. In the experiments, the proposed G-Net light achieves state-of-the-art performance and outperforms other lightweight vessel segmentation architectures in terms of accuracy and F 1 -score with fewer trainable number of parameters.