Weakly Supervised and Semi-Supervised Semantic Segmentation for Optic Disc of Fundus Image

Weakly supervised and semi-supervised semantic segmentation has been widely used in the field of computer vision. Since it does not require groundtruth or it only needs a small number of groundtruths for training. Recently, some works use pseudo groundtruths which are generated by a classified network to train the model, however, this method is not suitable for medical image segmentation. To tackle this challenging problem, we use the GrabCut method to generate the pseudo groundtruths in this paper, and then we train the network based on a modified U-net model with the generated pseudo groundtruths, finally we utilize a small amount of groundtruths to fine tune the model. Extensive experiments on the challenging RIM-ONE and DRISHTI-GS benchmarks strongly demonstrate the effectiveness of our algorithm. We obtain state-of-art results on RIM-ONE and DRISHTI-GS databases.


Introduction
Fundus retinal examination has been widely used in the prevention, diagnosis, and treatment of diabetic retinopathy, glaucoma, senile maculopathy, and other ophthalmic disease. The fundus image includes the main structures such as optic disc, macula, and vessels. The characteristic analysis of these structures is to judge foundation of fundus diseases. As one of the important structures of fundus image, the size and shape of the optic disc is the main auxiliary parameter to judge various ophthalmic diseases, which is often used as an index to judge retinopathy. The optic disc is also a key part of detecting other retinal structures. The distance between the optic disc and the macular area is generally fixed, so the information of the central position of the optic disc can be used as a prior knowledge to assist the selection of the macular area; the optic disc is the starting point of the fundus blood vessels, in which the blood vessels can be used as the starting seed point of the blood vessel tracking algorithm; the optic disc can also assist the establishment of the fundus image coordinate system to determine the location of other retinal abnormalities, such as microaneurysms and bleeding points hard exudation and hyaline verruca, etc. Therefore, retinal optic disc analysis in fundus images is a very important research issue [1][2][3][4][5][6][7][8].
Traditional optic disc segmentation can be divided into three categories: (1) A method based on morphology, contour lines can be found. The typical method is a watershed algorithm. The process of watershed calculation is an iterative labeling process, which has a good response to the weak edge. However, the noise in the image and the subtle gray changes of the object surface will produce the phenomenon of over-segmentation. Walter et al. [9] is the representative of this algorithm, the edge of the optic disc is not smooth and the accuracy is not very high. (2) A method based on the Hough transform, it mainly aims at the optic disc outline. The transformation of ellipse is to project two-dimensional points to five-dimensional parameters space, and circle is a special form of ellipse, only two-dimensional point projection is needed to the parameter space of 3D. Because the disc is not a standard circle, the split result is not as high as that of the ellipse, but because of the complexity of calculation, the circle increases exponentially, so the Hough circle changes becomes the key method of optic disc segmentation in algorithm [10,11]. The results of the algorithm can be straight used to segment the optic disc, but the accuracy is not high. The common method is to take the circle as the initial outline of the optic disc to assist other methods (such as activity Model method) to further improve the accuracy of segmentation. (3) A method based on the active contour model, the method of optic disc segmentation is to make the contour approach the edge automatically. In this algorithm, the curve on the image domain approaches to the edge of the object under the combined action of the internal force related to the curve itself and the external force defined by the image data. The main classifications are parametric active contour model and geometric active contour model. The former is an active contour model based on the variational method, which directly expresses the evolution of the curve in the parametric form. It constructs a specific energy function for a given model, and then uses the variational method to minimize the energy function, and obtains the partial differential equation of model evolution, which makes the contour automatically stop when it reaches the target boundary because the energy function reaches the minimum value. The latter is driven by the geometric characteristics of the contour to move towards the edge of the target, but has nothing to do with the parametric characteristics of the contour, which avoids the problem that the parametric active contour must repeat the parametric curve. The algorithm proposed by osareh et al. [12] is the representative of the former. It automatically initializes the contour by matching the template, transforms the color optic disc into LAB space, and removes the blood vessels, so as to obtain a higher segmentation effect. The algorithm proposed by Kande et al. [13] is the representative of the latter, which also needs to remove blood vessels in LAB space. The feature of this algorithm is to find the edge position of energy balance iteratively, so the calculation complexity is relatively high, and the average processing time of each image is more than ten seconds or even several minutes.
In recent years, deep learning has been widely used in the field of computer vision, which benefits from the rapid development of convolutional neural network. The method based on deep learning has been successfully applied in some medical image processing. Sevastopolsky [14] proposed an optic disc segmentation algorithm based on U-net. Tan et al. [15] developed a convolutional neural network to automatically and simultaneously segment optic disc. All these methods have achieved good results, the key to their successes is the strong learning ability of fully supervised deep convolutional neural networks (DCNN) models and the availability of large scale labeled databases. However, the labeling process is time-consuming and expensive, especially for medical data, which requires at least two medical experts to label. To tackle this issue, some weakly supervised learning methods and semi-supervised learning methods have been proposed one after another. Weakly supervised learning methods only use image-level labels or bounding-box to train models without using groundtruth to train, the semi-supervised learning methods only need a small amount of groundtruths to segment image, both of which can greatly save the human and material resources consumption caused by making labels (In order to better understand the differences between these labels, we compare them in Figure 1). Kolesnikov et al. [16] designed a loss function in their weakly supervised semantic segmentation algorithm, and the whole network only used image-level labels to train. Alexander et al. [17] proposed a maximum expected agreement model selection principle to perform a segmentation task based on image-level labels. All of these methods have achieved good results, this is the reason why weakly and semi-supervised learning are becoming more and more popular.
However, most of the current weakly supervised segmentation algorithms and semi-supervised segmentation methods are about the traditional image, and there are few methods about the medical image, especially the optic disc. Therefore, in this paper, we propose a weakly supervised and semi-supervised semantic segmentation algorithm for retinal optic disc images. First, we use the grabcut method as a baseline method to generate pseudo groundtruths, which only use bounding-box level labels as input. Then, we utilize the pseudo groundtruths as groundtruths to train the new model based on a modified U-net. Finally, a little groundtruths are used to fine-tune the model that has been trained in previous step. The experimental results on RIM-ONE and DRISHTI-GS databases prove the effectiveness and superiority of our algorithm. The overview of the proposed general framework has been shown in Figure 2.  The overview of the proposed general framework. As shown in the figure, it composed of three parts. The first part uses the grabcut method to generate pseudo groundtruths, the second part uses the generated groundtruths to train the network based on the modified U-net (this part belongs to the weakly supervised learning method), and the third part uses a little groundtruths to fine-tune the network (This part belongs to the semi-supervised learning method).
To sum up, the main contributions of this paper are: • In this paper, we propose a new method of optic disc segmentation based on weakly supervised learning and semi-supervised learning. The whole method only utilizes the bounding-box level labels and a few groundtruths to train the model. However, it takes a lot of manpower to annotate groundtruth in fully supervised semantic segmentation, and our final experimental results are close to the fully supervised method. This is the main contribution of this paper. We summarize the comparison between our method and the existing methods in Table 1.
• We crop the original image into dozens of patches to train the network, which increases the amount of data so that the network can learn the features of the optic disc more accurately. We improve the U-net model, by reducing the original U-shape structure, and add a convolutional layer with dimension 2 at the end of the convolutional layer, so that we can reduce the amount of calculation and make full use of the low-level features in training and get better segmentation results.

The Grabcut Baseline
Grabcut [18] is the established technique to estimate an object segment from its bounding box. Users only need to select the target in the box, and then all the pixels outside the box are taken as the background. At this time, a Gaussian mixture model (GMM) can be modeled and well-segmented. The segmentation process of the algorithm is divided into the following steps: Initialization

•
The user only provides T B to initialize trimap T. The foreground and background are set to Initialize a n = 1 when n ∈ T U , initialize a n = 0 when n ∈ T B . Each pixel in the image is segmented and represented by 'opacity' values a = (a 1 , ..., a n , ..., a N ). Where N is the number of pixels in the image, n is index which represents a specified pixel and a n is the corresponding 'opacity' value.

•
Initialize foreground GMMs and background GMMs with sets a n = 0 and a n = 1 respectively.

Iterative minimization
• (1) For every n in T U , pixels are assigned by GMM components: k n := arg min k n D n (a n , k n , θ, z n ).
The vector k = {k 1 , ..., k n , ..., k N } is a component of the GMM model, in order to process GMM in a traceable way. The parameters θ describe image foreground and background grey-level distributions, and the image is an array z = (z 1 , ..., z n , ..., z N ) of grey values, indexed by the index n. Function D n is one of the components of Gibbs function E. • (2) Using data z to learn GMM parameters: whereθ are the parameters of the GMM model, function U is the data term, and it is one of the components of Gibbs energy function E, the meaning of other variables has been described above.

•
(3) Use the min cut to solve the following expression: min a n :n∈T U min k E(a, k,θ, z).

User editing
• Edit: fix some pixels either to a n = 0 or a n = 1; update T at the same time. Execute step 3 above one time in the end.

•
Execute the entire iterative minimisation algorithm to refine, and this operation is optional.
The grabcut algorithm's overall thinking is as above. For more details, readers can refer to the original text [18]. Then we utilize this algorithm to generate pseudo groundtruths. Figure 3 shows some segmentation results based on RIM-ONE database.
From Figure 3, we can see that the grabcut method has been able to segment the approximate contour of the optic disc, but it is not good at the details of the edge of the optic disc, because of the pixels near the contour of the optic disc are very close. Therefore, there is still a lot of room for improvement. In the next section, we will train the model by using the pseudo groundtruths as groundtruth based on the improved U-net model to get better segmentation results.

Weakly Supervised and Semi-Supervised Optic Disc Segmentation
After we get the pseudo groundtruths, we do the following steps to realize our weakly supervised and semi-supervised optic disc segmentation: Preprocessing Deep learning architecture can learn features from raw data effectively. Usually, proper preprocessing can facilitate segmentation. This paper studies retinal image based on the green channel, this is because the green channel of RGB optic disc image shows higher contrast between the optic disc and the background. Besides, when we train neural networks, one of the most popular and effective ways to accelerate training is to normalize the input. We use the following formula to normalize the green channels of images: where I G is the original green channel input, σ and µ represent standard deviation and the mean of the image data respectively. In general, contrast enhancement can be achieved by stretching the gray value of low-contrast images. We use CLAHE operator [19] to get a local contrast enhanced optic disc image. After adaptive histogram equalization with limited contrast, gamma adjustment [20] is used for further image processing. Gamma adjustment is a method to correct the image and improve the contrast.

Frame structure
The framework uses cropped input which effectively expands the amount of data and makes the model learn the low-level features of the optic disc more accurately. Figure 4 shows the improved U-net structure used in this paper. The training of the model is performed on patches of the preprocessed full images. The dimension of each patch is 300 × 300, which is obtained by randomly selecting its central position in the whole image. The first 90% of the dataset is used for training and the last 10% for validation.
In the test stage of the experiment, in order to get a complete result map, we must sew the predicted patches into a complete image. For each pixel, the optic disc probability is obtained by averaging the probability of all prediction patches covering the pixel. This strategy uses context information to help eliminate unexpected errors caused by a few patches.
The proposed neural network model is derived from U-net architecture, but we have improved its model structure. We simplify the U-net by reducing layers, because the optic disc features are relatively less diversified when compared with other databases. Similar to the middle layer of fully convolutional neural network (FCN), we choose to decrease the last dimension from 32 to 1, which is the number of classes of optic disc segmentation-optic disc. During the training stage, the loss function can be formulated as following: where y i ∈ {0, 1} is the pseudo groundtruths generated in Section 2.1, y i is the predicted results. θ is the network parameter. The coefficient β is used to balance background pixels and foreground pixels.
In this article, we set β to 0.9. Y + and Y − represent the foreground and background sets. The overview of the proposed method has been shown in Figure 2, and we have described the pseudo code in Algorithm 1.

Experiments
We experimented with the proposed algorithm and compared it with the current state-of-the-art algorithm. The experimental results show that our algorithm is effective.

Datasets
In the experiment of this paper, we use the public dataset RIM-ONE [21] and DRISHTI-GS [22] to evaluate the feasibility and effectiveness of the proposed algorithm. RIM-ONE is an open retinal image database for optic nerve evaluation, as it contains 159 images of the optic nerve head. Each image corresponds to the results of manual segmentation by more than two medical experts. DRISHIT-GS database contains 50 full fundus images. Each fundus image has an optic disc segmentation map and optic cup segmentation map.

Implementation Details
Our framework is based on the convolutional neural network architecture U-net. We first simplified it by reducing the layers. The modified architectures is shown in Figure 4. Every input image was cropped into 50 300 × 300 patches, we have shown some examples based on RIM-ONE dataset in Figure 5. The weights were initialized according X Glorot's method [23]. Then, we used Adam optimization to train the network, the learning rate and the batch size were set to 0.001 and 32, respectively. The total number of iterations was 30 epochs, the RIM-ONE and DRISHTI-GS datasets were split randomly into 80% training, 20% testing sets. At inference stage, a fully connected condition random field model [24] was applied to optimize the result. The implementation of the proposed paper is based on the Keras framework, which performs all computation on GPUs in single precision arithmetic. We perform all of our experiments on an NVIDIA GTX TITAN Xp GPU (memory Clock Rate (GHz) 1.582) with 48 GB of memory.

Evaluations
We evaluate the prediction results with intersection over union (IoU), accuracy (Acc), and sensitivity (Sen). They are defined as follows: where TP (true positive) represents the positive cases judged to be positive, FN (false negative) represents the positive cases judged to be negative, TN (true negative) represents the negative cases judged to be negative, and FP (false positive) represents the negative cases judged to be positive.

Results and Discussion
In order to prove the effectiveness of our weakly-and semi-supervised segmentation method, we carried out a verification experiment on the database RIM-ONE and DRISHTI-GS. In this paper, we only randomly selected 10% groundtruths from the training set to implement our semi-supervised method. The Linux operating system version used in this paper is Ubuntu 16.04. We perform all of our experiments on an NVIDIA GTX TITAN Xp GPU (memory Clock Rate (GHz) 1.582) with 48 GB of memory, and the implementation is based on the Keras framework with Python 3.6 programming language. Figures 6 and 7 show the partial segmentation results of the proposed algorithm. From Figures 6 and 7, we can see that our weakly supervised and semi-supervised optic disc segmentation algorithm can segment the contour of the optic disc image very well. Although without groundtruth training (the proposed weakly supervised method) or just using a few groundtruths to train (the proposed semi-supervised method), the proposed method can still segment the optic disc accurately, and the result is very close to that of groundtruth. At the same time, we describe the changes of loss in the training stage of the two methods in Figures 8 and 9.
Because there are very few algorithms about the weakly-and semi-supervised segmentation of optic disc, we compare the proposed algorithm with the fully supervised segmentation method. Some comparisons between our method and other fully supervised methods are described in Tables 2 and 3. From Tables 2 and 3, we can know that without groundtruth training or just using a few groundtruths to train, our segmentation results have achieved good results and exceeded the partial fully supervised algorithm. However, compared with Zahoor's method and Zilly's method, the proposed method still has a very small gap, which is due to the lack of groundtruth for training and the poor segmentation in some details of the optic disc. This shows that our weakly-and semi-supervised method still has great room for progress. At the same time, we use the proposed framework to conduct a fully supervised experiment (here we use all groundtruths for training). The experimental results show that our fully supervised method exceeds the current state-of-the-art method in terms of IoU indicators. To sum up, the methods proposed in this paper, whether based on weakly supervised learning or fully supervised learning, have achieved results close to or beyond the state-of-the-art method. We also compare the prediction time per image. From the Tables 2 and 3, we can see that the prediction time of our method is longer than Sevastopolsky's method. This is mainly because each input image is cropped into 50 300 × 300 patches in our method, which is equivalent to that Sevastopolsky's method predicts one image in 0.1 s, while our method predicts 50 300 × 300 patches in 4.49 s, and then sew them into a complete image (The resolution of a complete image used in this paper is 500 × 500). In fact, the real-time performance of the two methods is almost the same.

Conclusions
This paper solves the problem of groundtruth annotation in medical image segmentation, we present a new weakly-and semi-supervised segmentation method of optic disc, which can segment the optic disc without the training of groundtruth marked by medical experts, so a lot of manpower is saved to mark samples. The proposed algorithm has achieved good segmentation results on RIM-ONE and DRISHTI-GS databases, which proves the effectiveness of our method.
Author Contributions: Conception of project, D.C. and Z.L.; execution, D.C. and Z.L.; manuscript preparation, Z.L. and D.C.; writing of the first draft, Z.L. All authors have read and agreed to the published version of the manuscript.