An End-to-End Real-Time Lightweight Network for the Joint Segmentation of Optic Disc and Optic Cup on Fundus Images

: Glaucoma is the second-most-blinding eye disease in the world and accurate segmentation of the optic disc (OD) and optic cup (OC) is essential for the diagnosis of glaucoma. To solve the problems of poor real-time performance, high algorithm complexity, and large memory consumption of fundus segmentation algorithms, a lightweight segmentation algorithm, GlauNet, based on convolutional neural networks, is proposed. The algorithm designs an efﬁcient feature-extraction network and proposes a multiscale boundary fusion (MBF) module, which greatly improves the segmentation efﬁciency of the algorithm while ensuring segmentation accuracy. Experiments show that the algorithm achieves Dice scores of 0.9701/0.8959, 0.9650/0.8621, and 0.9594/0.8795 on three publicly available datasets—Drishti-GS, RIM-ONE-r3, and REFUGE-train—for both the optic disc and the optic cup. The number of model parameters is only 0.8 M, and it only takes 13 ms to infer an 800 × 800 fundus image on a GTX 3070 GPU.


Introduction
Glaucoma is a chronic eye disease that causes irreversible damage to vision. Patients with glaucoma suffer damage to the optic nerve due to increased intraocular pressure (IOP) caused by an imbalance between fluid production and drainage in the eye. The vertical cup-to-disk ratio (CDR) is one of the commonly used indicators for clinical screening of glaucoma. Usually, a CDR greater than 0.65 [1] is diagnosed as glaucoma. Figure 1 is a sample map of normal eyes and glaucoma fundus and the corresponding annotation map of the OD and OC. Accurate segmentation of the OD and the OC is essential for accurate CDR acquisition. At present, the clinical diagnosis of glaucoma is mainly made by ophthalmologists through manual diagnosis, which is somewhat subjective-results vary greatly between doctors and are inefficient. With the rapid development of information technology, rapid advances in medically assisted diagnostic techniques have been made, making large-scale glaucoma screening possible.
Semantic segmentation is a fundamental task of computer vision. With the great success of deep learning in the field of computer vision, algorithms based on deep learning have improved in efficiency and accuracy compared with traditional machine learning algorithms, providing new ideas for the development of medical image-assisted diagnosis technology. In recent years, segmentation algorithms for OD and OC based on deep learning have emerged one after another. The following are some excellent algorithms based on convolutional neural networks: The authors of [2] proposed an attention U-Net fundus-image-segmentation algorithm based on transfer learning. The algorithm proposes an attention gate module which is used to focus on the target area. For the acquisition of the pretraining weights of the algorithm, first, model training is performed on the DRIONS-DB dataset to obtain a set of pretraining weights; these are trained on the Drishti-GS dataset to further modify the pretraining weights. Finally, the fundus images are segmented using the trained attention U-Net model combined with transfer learning. The authors of [3] proposed a segmentation network called BGA-Net, which is combined with adversarial learning to obtain a set of optimal model weights through alternate training to better segment OD and OC. The authors of [4] proposed an unsupervised domain-adaptation network called BEAL, which suppresses the geometric structure of the boundary while generating more realistic boundaries through adversarial learning. This method effectively reduces the formation of sawtooth on the segmentation boundary and improves the segmentation accuracy of OD and OC. The authors of [5] proposed a two-stage approach that first locates the OD and then jointly segments the OD and OC according to the region of interest. The method uses depthwise-separable convolution to improve the segmentation efficiency of the network model and adds a multiscale image pyramid to improve the accuracy and robustness of OD and OC segmentation methods. In [6], an unsupervised adaptive segmentation method for OD and OC is proposed. The method uses the image synthesis mechanism of GAN for feature alignment of the output image, while an edge-attention module (EAM) is introduced to enhance the representation of boundary information. The method outperforms other methods in the unsupervised approach, while the method is more advantageous on small datasets. The above deep-learning-based methods have achieved excellent segmentation performance in OD and OC segmentation tasks, but also have the following problems: (1) Feature-extraction networks that use classical classification networks as segmentation models (such as VGG [7], ResNet [8], DeepLab [9][10][11][12]). While such networks have good semantic segmentation performance, this is accompanied by tens of millions of network parameters. (2) The algorithm design is often complex, the computational complexity is high, and the reasoning time is long; so, it cannot meet the needs of large-scale glaucoma screening. (3) The algorithm itself has high requirements for computing power and the memory of the device, making it difficult to deploy and apply on mobile devices. In response to the above problems, this paper is devoted to designing a lightweight and efficient OD and OC segmentation algorithm, balancing the performance of the algorithm, model size, and inference speed so that the algorithm can meet the requirements of mobile devices. Real-time semantic segmentation has now become an important topic in edge computing and a large number of excellent algorithms have been proposed. Inspired by real-time segmentation networks [13][14][15][16], this algorithm designs a simple and lightweight feature-extraction network using a small amount of ordinary convolution, deep separable (DS) convolution, and asymmetric convolution (AC) for the extraction of spatial detail and contextual information, and a multiscale boundary fusion (MBF) [17][18][19] module to capture the OD and OC boundaries. In summary, the main contributions of this paper are as follows: 1 An end-to-end lightweight and efficient network model for OD and OC segmentation,  GlauNet, is proposed, which significantly reduces the number of model parameters  and computational complexity, and achieves competitive OD and OC segmentation  results without the need for pretrained weights.  2 A multiscale boundary fusion (MBF) module is designed according to the characteristics of fundus images, including a multiscale feature fusion (MFF) branch and a boundary feature auxiliary (BFA) branch. This module fuses the multiscale feature map obtained by the MFF branch and the boundary feature map obtained by the BFA branch, which improves the segmentation accuracy of the optic disc and the optic cup and the robustness of the segmentation algorithm. 3 GlauNet is committed to the deployment and application of mobile devices. The model parameters are only 0.8M, and it only takes 13 ms to assess an 800 × 800 fundus image on a GTX 3070 GPU.
The rest of this paper is organized as follows: Section 2 presents the segmentation model; Section 3 presents the experimental details and experimental results; Section 4 presents the ablation experiments; Section 5 discusses the algorithm and experimental results; Section 6 presents the summary of the paper.

Methods
The GlauNet segmentation model is mainly composed of three modules: the spatialdetail information-extraction module (A); the context-information-extraction module (B); and the decoding head module (C). The network structure is shown in Figure 2. The spatial-detail information-extraction module consists of three standard convolutional layers and an MBF module, where the blue dotted box in module A represents the MBF module. To reduce the loss of detail information during convolution, the loss of spatial detail during convolution is reduced by sequentially increasing the number of channels in the convolution layer. As the spatial information in fundus images is relatively homogeneous, using a large number of convolutional channels is a small or even detrimental improvement in feature-extraction ability. At the same time, to keep the spatial-detail information-extraction module lightweight and efficient, the number of channels of the three standard convolutional layers is 32, 48, and 64 in turn; each convolutional layer is followed by a batch-normalization [20] layer and a ReLU activation function, each with a stride of 2 and a convolutional kernel size of 3. To obtain richer semantic information, the MBF module is used for the extraction of multiscale information and the boundary information of the OD and OC. The size of the feature map output from the spatial-detail information-extraction module is 1/8 of the size of the input image. The specific structure is shown as A module in Figure 2.

Context-Information-Extraction Module
The contextual information extraction module consists of three blocks and an MBF module, where the blue dotted box in module B represents the MBF module. Each block consists of two bottleneck-inverted residual structures, and the function of the block is to efficiently extract the contextual information of the fundus image. To reduce model parameters and computation, this module uses depthwise-separable (DS) convolutions instead of standard convolutions. DS convolution consists of depthwise (DW) convolution and pointwise (PW) convolution. Compared with DS convolution, standard convolution is theoretically 8-9 times more costly in terms of the number of parameters and computation than DS convolution. The ratio of parameter amount and calculation amount of standard convolution and DS convolution is shown in Equation (1), the numerator is standard convolution, and the denominator is DS convolution.
where F is the size of the convolution kernel, M is the number of input channels, N is the number of output channels, and H and W are the height and width of the input image, respectively. C cc is the proportion of the computational cost and C pc is the proportion of the parametric cost.
Specifically, the efficient bottleneck-inverse residual structures from MobileNet-v2 [21] are used. The number of bottleneck-inverse residual structures in each block is 2, and the number of channels in the output feature map is 64, 96, and 128 in that order. In the first two blocks, the first bottleneck-inverse residual structures has a stride of 2, and the remaining bottleneck-inverse residual structures have a stride of 1. Then, an MBF module is connected to obtain a contextual feature map of size 16 × 16. To obtain more spatial detail information, the obtained contextual feature map is first upsampled four times, then the number of channels of the spatial detail feature map obtained from the spatial-detail information-extraction module is dimensionally increased to the same number of channels as the contextual feature map (65→129); then, the upsampled contextual feature map and the spatial detail feature map are fused with the features. The specific structure is shown in Module B in Figure 2.

Decoder Header Module
The decoding head module consists of two DS convolutions, a boundary feature auxiliary branch, and a standard convolution. The feature map fused by the contextinformation-extraction module and the spatial-detail information-extraction module is sent to the decoding head module. In the decoding head module, a DS convolution is first performed, and then the obtained feature maps are sent to a boundary feature auxiliary branch and a DS convolution branch, respectively. Then, the output feature maps of the two branches are concatenated. The obtained feature map is classified into pixels using standard convolution to obtain a segmented image with a size of 64 × 64, and the number of output channels is 129, 130, and 2. Finally, the segmentation map is upsampled eight times to obtain the segmentation map of the OD and OC with the same size as the input image. All convolution operations in this module have a stride of 1. The specific structure is shown in the C module in Figure 2.

Multiscale Boundary-Fusion Module
According to the features of fundus images, the MBF module is designed to better extract semantic information and boundary information. The module includes a boundary feature auxiliary (BFA) branch and a multiscale feature fusion (MFF) branch. The BFA branch is used to extract the boundary features of the input feature map, i.e., the boundaries of the OD and the OC. In this paper, the standard 1 × 1 convolution is used for the boundary extraction of the OD and the OC, with a stride of 1. The BFA branching operation is computationally trivial but is very effective in extracting the boundaries of the OD and the OC, as experimentally demonstrated in the ablation experiments section. The MFF branch includes multiple feature-extraction branches with different scales; each branch performs feature extraction through AC of different expansion coefficients and then performs feature fusion on the feature maps of different branches. Finally, the obtained boundary information map and multiscale feature fusion map are concatenated to obtain richer semantic information.
The specific structure of the MBF in the blue dashed box in modules A and B in Figure 2 is shown in Figure 3, including four multiscale feature-extraction branches and one boundary feature auxiliary branch. Each multiscale feature-extraction branch consists of three asymmetric convolutions [17], and each AC consists of an n × 1 and 1 × n convolution. Here, we decompose the 3 × 3 convolution into a 3 × 1 convolution and a 1 × 3 convolution. Compared with a 3 × 3 convolution operation, using AC saves about 33% of the number of parameters for the same number of convolution kernels. The theoretical analysis is shown in Equation (2). The fusion of multiscale features is performed by setting different expansion coefficients. The size of the convolution kernels of the four multiscale featureextraction branches is 3 × 3, and the size of the receptive field of the convolution kernels with different expansion coefficients is shown in Equation (3). For the MBF module in the spatial-detail information-extraction module, the expansion coefficients of the four branches are [1,1,2,3] in sequence. For the MBF module in the context-informationextraction module, the expansion coefficients of the four branches are [1,2,3,5] in sequence. At the same time, to reduce the feature loss in the convolution process, for each multiscale branch, the output feature maps of the three AC operations are concatenated as the output feature map of the branch.
P is the ratio of the parameters of the AC operation and the standard convolution operation, and n is the size of the convolution kernel. As the size of the convolution kernel increases, the proportion of the number of parameters saved by the AC operation continues to increase.
c is the expansion coefficient, and n is the size of the convolution kernel, and R represents the size of the receptive field of the convolution kernel.

Loss Function
This paper uses the binary cross-entropy function as the loss function of the segmentation algorithm. The loss function is defined as follows: σ(z) is the Sigmoid function, N is the number of pixels,y m i is the annotation map, and p m i is the prediction map of GlauNet.

Datasets
Drishti-GS [22]: The Drishti-GS dataset consists of 101 fundus images with a resolution of 2047 × 1759. The dataset consists of 31 normal fundus images and 70 glaucoma fundus images, with 50 images in the training set and 51 in the testing set. Each image was manually annotated by four ophthalmologists with different clinical experiences.
RIM-ONE-r3 [23]: The RIM-ONE-r3 dataset consists of 159 fundus images with a resolution of 2144 × 1424, including 85 normal fundus and 74 glaucoma fundus images. Among the 159 fundus images, 99 fundus images were used as the training set and 60 fundus images were used as the testing set.
REFUGE [24]: The REFUGE dataset consists of 1200 fundus images, including 1080 normal fundus images and 120 glaucoma fundus images. The dataset consists of a training set, a validation set, and a testing set, each subset containing 400 fundus images. The training set has a resolution of 2124 × 2056 and the validation and testing sets have a resolution of 1634 × 1634.

Implementation Details
The network model was implemented based on the PyTorch 1.10 deep learning framework with CUDA version 11.4, and all experiments were conducted on a single NVIDIA GTX 3070 GPU. The network model is trained using the Adam optimizer, and the Momentum is set to 0.9. The network model uses the poly learning strategy for model training. The decay strategy of the learning rate is lr = base-lr × (1 − iter max−iter ) power , the initial learning rate is set to 0.001, and the power is set to 0.9. Due to the small size of the Drishti-GS and RIM-ONE-r3 datasets, we achieved convergence of the loss function after training 1000 epochs on these two datasets with a batch size equal to 12. For the REFUGE dataset, this paper uses the REFUGE-train subset as the experimental dataset. We use 320 REFUGEtrain subset fundus images as the training set and 80 REFUGE-train subset fundus images as the testing set, and the loss function converges after training 300 epochs with batch size equal to 12.
For the preprocessing of fundus images, the region of interest was cropped according to the literature [25] and the final input image size was 512 × 512. After an experimental comparison, it was found that the cropping of the region of interest not only reduces the computational effort of the network model but also has a greater improvement on the segmentation results. Due to the relative difficulty of acquiring fundus images, the current publicly available fundus dataset is relatively small and we have used many dataenhancement methods to increase the diversity of the data. The data-enhancement methods we use are random scaling, rotation, flipping, elastic transformation, contrast adjustment, adding noise, and random erasure. To optimize the output, we also perform morphological operations on the resulting segmented images to make the boundaries of the segmented images smoother and more natural using erosion and hole-filling operations, the principles of which are shown in (6) and (7).
A represents the target of post-processing, B is the structural element, Θ represents that B performs the corrosion operation on A, and z is the size of the translation vector.
X represents the set of all filled holes, B is the structural element, A c is the complement of the fundus image, ⊕ is the hole operation, and k is the number of iterations.

Evaluation Criteria
This paper uses DI (dice index), Jaccard (IoU), sensitivity (SEN), and CDR as the evaluation criteria for the GlauNet segmentation network. DI, Jaccard, SEN, and CDR are defined as follows: N TP , N FP , and N FN represent the number of true positives, false positives, and false negatives, respectively, CDR p and CDR g represent the vertical cup-to-disc ratio of the OD and OC of the predicted segmentation map and the vertical cup-to-disc ratio of the OD and OC of the annotated map, respectively. This paper uses the average CDR error δ to evaluate the difference between CDR p and CDR g , and lower δ values represent better prediction results.

Experimental Results
This paper presents a comparative analysis of the experimental results in terms of both quantitative and qualitative aspects. For the Drishti-GS and RIM-ONE-r3 datasets, the results of the experimental comparison of the method in this paper with some classical methods and some current advanced methods are shown in Tables 1 and 2. Some methods are not open-source; we obtained experimental results from the original paper, and we conducted experiments under the same conditions. The results show that the proposed method achieves competitive or state-of-the-art experimental results on various evaluation metrics. Figures 4 and 5 present the qualitative experimental results comparison with U-Net [26], FCN-8s [27], and BGA-Net [3]. Green represents the OD boundary and blue represents the OC boundary. For the REFUGE-train dataset, this paper reproduces the methods in Table 3 under the same conditions. The results show that our method also achieves competitive or stateof-the-art results on various evaluation metrics. The qualitative experimental results are shown in Figure 6.    [27] 0

Model Performance
The receiver operating characteristic (ROC) curve and its corresponding area under the curve (AUC) [35] are used to assess the performance of the algorithm in detecting glaucoma. In general, the higher the AUC, the higher the diagnostic accuracy of the algorithm, indicating a better performance of the algorithm. This paper uses the ROC curve to evaluate the segmentation performance of GlauNet. Figure 7 shows the ROC curves and corresponding AUC values for the three datasets-Drishti-GS, RIM-ONE-r3, and REFUGE-train.
To better analyze the correlation between the mean CDR error δ and AUC. A line graph of the relationship between δ and AUC for the Drishti-GS, RIM-ONE-r3, and REFUGE-train datasets is given in Figure 8. The results show that AUC is not always strictly negatively correlated with δ, and can only reflect the correlation between average CDR error δ and AUC to a certain extent, but it still has a certain guiding significance. In current glaucoma research, the mean CDR error δ is still used as an important indicator in the diagnosis of glaucoma. This paper compares the performance of model parameters, required memory, computational complexity, and inference time with some advanced methods in this field. In order to ensure the comparability of the results, the methods in Table 4 were all carried out under the same experimental conditions. The input image is an 800 × 800 RGB image, and the experimental equipment is an NVIDIA GTX 3070 GPU. The results show that the method outperforms other methods on most evaluation metrics while obtaining competitive OD and OC segmentation results.
This paper analyzes the network model size, FLOPs, inference time, and Jaccard on the REFUGE-train dataset. The area of the circle represents the size of the network model, FLOPs, and inference time, respectively.
First, the relationship between the size of the network model and the Jaccard accuracy is analyzed, as shown in Figure 9. Our method achieves competitive results with the smallest network model size. The relationship between FLOPs and Jaccard accuracy is then analyzed, as shown in Figure 10. Our method achieves competitive results with minimal FLOPs. Figure 11 analyzes the relationship between inference time and Jaccard accuracy. The inference time of our method is second only to LR-ASPP [28], which is only about half of the other methods while achieving competitive segmentation results.

Ablation Experiments
To verify the effectiveness of the multiscale feature fusion (MFF) module and the boundary feature auxiliary (BFA) module, we used the model with the MFF and BFA modules removed as the baseline to perform ablation experiments on the RIM-ONE-r3 dataset and gave quantitative results and qualitative results in Table 5 and Figure 12. The results demonstrate the effectiveness of the MFF and BFA modules.
The results of ablation experiments show that the MFF module and BFA module can significantly improve the segmentation effect of the optic cup. Figure 13 shows the entropy map of the OC in baseline, baseline+MFF, and baseline+MFF+BFA (GlauNet). Figure 13 shows that the MFF module can effectively reduce the entropy of the cup boundary prediction map, but there is still some boundary noise. Based on the MFF module, the BFA module further suppresses the boundary noise and highlights the edge structure information of the optic cup. The entropy graph in Figure 13 demonstrates the effectiveness of the MFF module and the BFA module from another perspective.

Discussion
To meet the performance requirements of mobile devices and edge devices, this paper proposes a simple and efficient fundus-image-segmentation algorithm GlauNet. We experimentally verify the proposed algorithm on Drishti-GS, RIM-ONE-r3, and REFUGE-train public datasets. The performance of the proposed algorithm is evaluated from two aspects: quantitative analysis and qualitative analysis. In this paper, four indices-DI, Jaccard, sensitivity, and CDR-are used as the evaluation criteria for the segmentation results. Tables 1-3 show the results of quantitative experiments on the Drishti-GS, RIM-ONE-r3, and REFUGEtrain datasets, which show that our algorithm achieves competitive or state-of-the-art segmentation results for each evaluation criterion compared with current state-of-the-art algorithms. Figure 4-6 are partial segmentation renderings randomly selected on the Drishti-GS, RIM-ONE-r3, REFUGE-train datasets, including the region-of-interest images and the corresponding manual annotation maps, U-Net, FCN-8s, the segmentation map of the BGA-Net algorithm, and the algorithm in this paper. The segmentation effect of our proposed algorithm is significantly better than that of U-Net and FCN-8s, with a 51% and 48% improvement in inference time compared with U-Net and FCN-8s, respectively. Our proposed algorithm achieves segmentation results comparable to the hand-labeled graph and BGA-Net algorithms, while the inference time is improved by 40% compared with BGA-Net.
In this paper, the ROC curve was used to evaluate the performance of the proposed algorithm in diagnosing glaucoma, and its ROC curve is shown in Figure 7. Our algorithm achieved a diagnostic accuracy of 91.30, 81.22, and 99.61 on the Drishti-GS, RIM-ONE-r3, and REFUGE-train datasets, respectively. To validate the effectiveness of the MFF and BFA modules in this paper, we conducted ablation experiments on the RIM-ONE-r3 dataset. The quantitative results in Table 5 and the qualitative results in Figure 12 show that the MFF and BFA modules bring a significant improvement in segmentation performance. To verify the computational performance of the proposed algorithm, we compare and analyze the performance of the model with some current state-of-the-art methods. The results in Table 4 and Figures 9-11 show that our algorithm is much smaller than some of the current state-of-the-art algorithms in terms of the number of model parameters, computational complexity, and memory, while achieving competitive or state-of-the-art segmentation results, demonstrating the lightweight nature and efficiency of the proposed algorithm.
However, as the algorithm does not use pretrained weights, our training efficiency is low, requiring 1000 epochs for the Drishti-GS and RIM-ONE-r3 datasets and 300 epochs for the REFUGE-train dataset for the algorithm's loss function to reach convergence.

Conclusions
In this paper, we propose a lightweight medical-image-segmentation algorithm GlauNet for joint OD and OC segmentation, which consists of a spatial-detail information-extraction module, a contextual-information-extraction module, and a decoding head module. To obtain richer boundary information, an MBF module is proposed, which is beneficial to the segmentation results through experimental comparison. The algorithm has high realtime performance, low algorithm complexity, and a small memory footprint compared with current state-of-the-art algorithms. We conducted extensive experimental comparisons on the Drishti-GS, RIM-ONE-r3, and REFUGE-train datasets and showed that the GlauNet algorithm achieved competitive or better segmentation results than current state-of-the-art methods on the three fundus datasets with model parameters of 0.8M. In the future, we will continue to work on lightweight segmentation algorithms and apply the proposed methods to more medical image segmentation tasks.  Data Availability Statement: Publicly available datasets were analyzed in this study. The code can be accessed at https://github.com/liu1037342030/GlauNet (accessed on 1 November 2022). These data can be found at: https://ai.baidu.com/broad/download (accessed 1 November 2022).

Conflicts of Interest:
The authors declare no conflicts of interest.