Unsupervised Domain Adaptation with Shape Constraint and Triple Attention for Joint Optic Disc and Cup Segmentation

Currently, glaucoma has become an important cause of blindness. At present, although glaucoma cannot be cured, early treatment can prevent it from getting worse. A reliable way to detect glaucoma is to segment the optic disc and cup and then measure the cup-to-disc ratio (CDR). Many deep neural network models have been developed to autonomously segment the optic disc and the optic cup to help in diagnosis. However, their performance degrades when subjected to domain shift. While many domain-adaptation methods have been exploited to address this problem, they are apt to produce malformed segmentation results. In this study, it is suggested that the segmentation network be adjusted using a constrained formulation that embeds prior knowledge about the shape of the segmentation areas that is domain-invariant. Based on IOSUDA (i.e., Input and Output Space Unsupervised Domain Adaptation), a novel unsupervised joint optic cup-to-disc segmentation framework with shape constraints is proposed, called SCUDA (short for Shape-Constrained Unsupervised Domain Adaptation). A shape constrained loss function is novelly proposed in this paper which utilizes domain-invariant prior knowledge concerning the segmentation region of the joint optic cup–optical disc of fundus images to constrain the segmentation result during network training. In addition, a convolutional triple attention module is designed to improve the segmentation network, which captures cross-dimensional interactions and provides a rich feature representation to improve the segmentation accuracy. Experiments on the RIM-ONE_r3 and Drishti-GS datasets demonstrate that the algorithm outperforms existing approaches for segmenting optic discs and cups.


Introduction
Glaucoma is the second most common blinding disease after cataracts [1]. Using fundus pictures, the ratio of the vertical height of the optic cup to the optical disc can be used to determine an early diagnosis of glaucoma. Therefore, it has become a hot topic of research to accurately delineate the optic cup from the optic disc in fundus images and to accurately perform the CDR calculation. At present, deep learning-based techniques for segmenting the optic cup-optical disc have been proven to be effective and have attracted increasing attention in the field. Sevastopolsky et al. [2], for example, proposed a U-Net deep learning network-based method for segmenting the optic cup-optical disc by minimizing the number of convolutional kernels and network complexity. Fu et al. [3] proposed converting the Cartesian coordinates of fundus images into polar coordinate form, and used a U-Net neural network with multi-scale inputs and multi-scale outputs to achieve better performance in optic cup-optical disc segmentation. Most optic cup-optical disc segmentation models work best when the distribution of the test set and training set are the same. Nevertheless, these models tend to perform worse when applied to target domains other than the one they were trained on. This problem is known as a domain shift or distributional shift. Domain adaptation is usually utilized to cope with this problem. According to the information considered for the target task, domain adaptation can be divided into three types, namely, unsupervised, semi-supervised, and supervised domain adaptation. Among them, unsupervised domain adaptation is the one we are most concerned with here. A number of unsupervised domain adaptive algorithms have been proposed for the mitigation of domain shifts in biomedical image segmentation [4][5][6]. For instance, studies of source and target domain domains based on common invariance properties [4,5] concentrate on partitioning the input space of the network. In order to ensure that the segmentation network's output space is invariant and that the segmentation maps of the source and target regions have the same spatial and geometric shape, [6] employed adversarial learning. Chen et al. [7] proposed an unsupervised framework called IOSUDA for the joint segmentation of the optic cup and optical disc. This framework focuses on separating shared features and stylized features for feature alignment, achieving input and output space alignment, and reducing performance degradation. Although these methods have achieved remarkable performance, they are apt to produce malformed segmentation regions, as demonstrated in Figure 1, that are very far from the real shapes of the optic cup and optical disc. Here, we propose to overcome this issue using a formulation with constraints that, based on the shape of the segmentation region, contain domaininvariant prior information for segmentation networks. The intuition behind our work is that shape information is a strong and valuable prior for optic cup and disc segmentation, as geometrically the optic cup or disc is very close to a round shape. The effectiveness of shape constraints has been proven very recently in 3D pancreas segmentation [8], motivating us to make use of it for the task at issue here. As seen in Figure 1, our method is capable of providing more realistic segmentation results with the proposed shape constraint.
On the other hand, the U-Net [9], a very effective but highly underutilized network introduced by Ronneberger et al. in 2015 for medical image segmentation, serves as the segmentation sub-network in IOSUDA. In order to locate and extract invariant features from the dataset, Zhang et al. [10] suggested a transferable attention U-Net model that used two discriminators and an attention module. Zhao et al. [11] added an attention gate between the encoder-decoder of U-Net in order to concentrate more on the target region, resulting in an attention U-Net architecture. These works suggest that attention mechanisms are effective in boosting the performance of U-Net, which inspires us to attempt a more advanced attention approach for further improvement. Recently, the use of channel attention, spatial attention, or both has been suggested in several studies on computer vision problems as a way to enhance the feature representation ability of by convolutional layers in order to enhance the performance of neural networks. For instance, the Squeeze-and-Excitation (SE) module [12] calculates channel attention and improves performance at a fraction of the cost. Moreover, the Convolutional Block Attention Module (CBAM) [13] and the Bottle-neck Attention Module (BAM) [14] both emphasise the combination of spatial attention and channel attention. Both the BAM (i.e., Bottleneck Attention Module) and CBAM (i.e., Convolutional Block Attention Module) place emphasis on the union of spatial and channel attention. The Convolutional Triple Attention Module [15] is a lightweight yet effective attention mechanism that calculates attention weights by way of capturing interactions of cross dimensions using a three-branch structure. The segmentation performance of the segmentation sub-network U-Net is improved in this paper using a Convolutional Triple Attention Module (CTAM).
The following may be said about this paper's contributions: • We propose a novel unsupervised adaptive framework with shape constraint, called SCUDA, for joint segmentation of the optic cup-optical disc in order to address the problem that existing methods are very likely to produce malformed segmentation regions. • We exploit a convolutional triple attention module to improve the segmentation network, which is able to capture cross-dimensional interactions and provides rich feature representation in order to boost segmentation accuracy.
• We conducted a number of extensive experiments on the RIM-ONE_r3 dataset and the Drishti-GS dataset to demonstrate the performance of our performed SCUDA framework. The experimental findings verify that SCUDA outperforms the other tested model in terms of performance.
IOSUDA SCUDA (Ours) GT Figure 1. Comparison of the segmentation results between the state-of-the-art method IOSUDA [16] and our SCUDA method on a fundus image. The abbreviation GT refers to ground truth.
The remainder of the paper is structured as follows: we review related work and describe our methodology in Sections 2 and 3, respectively; experimental findings are discussed in Section 4; and the work is concluded in Section 5.

Unsupervised Domain Adaptation
A fairly common type of transfer learning is domain adaptation, which generally refers to using a model from one domain and apply it another domain that is only subtly different [17]. Unsupervised domain adaptation in classification is generally built on image and feature alignment [18][19][20][21] between source and target domains. For instance, Long et al. [22] proposed a new network architecture, Deep Adaptation Network (DAN), that used an optimal multi-core selection method for average embedding matching and was able to reduce domain differences. Bousmalis et al. [18] considered shared and private representations of each domain. Unsupervised domain adaptive segmentation has been used for many scenarios, including across various medical images. For example, according to Chen et al. [23], the network can be trained using images from the source domain, with the target domain's image style being the same as that of the source domain. Huo et al. [24] proposed a Synthetic Segmentation Network (SynSegNet) in order to stylize images from the source domain into those from the target domain. Song et al. [25] introduced several assumptions for feature space extraction; based on this, each loss function was derived and optimized. In addition, to compare the feature spaces of the source, target, and output domains with one another, Chen et al. [26] proposed Synergistic Image and Feature Alignment (SIFA).

Optic Cup-Optical Disc Segmentation
Early work in optic cup-optical disc segmentation focused on hand-crafted features [27][28][29], usually implemented first for target region detection [30,31]. Convolutional neural network-based approaches [3,32,33] have significantly improved accuracy and generalizability. A convolutional neural network for segmentation based on lifting trees was designed by Zilly et al. [32]. Fu et al. [33] proposed the Disc-aware Ensemble Network (DENet) for automated glaucoma screening, which integrates data from local optic disc regions with features from global fundus images. A U-Net based M-Net was proposed by Fu et al. [3] to segment the optic disc-cup, with the segmentation issue converted to a multi-label issue. In addition, a number of semi-supervised methods [34,35] have been proposed to alleviate the problem of insufficient truth labels of the original data. However, these models lack generalization in the face of domain shifting. Recently, unsupervised domain adaptation has made a splash in segmentation of optic cup-optical disc cross data sets [7,36,37]. In order to solve instability in adversarial learning, Liu et al. [36] pro-posed Collaborative Feature Ensembling Adaptive (CFEA), which makes use of adversarial learning for both the network's output and intermediate representations. Wang et al. [37] proposed Boundary-and Entropy-driven Adversarial Learning (BEAL), which employs two boundary and entropy discriminators to effectively solve the problem of a target domain's high entropy and fuzzy boundary. For joint segmentation of the optic disc and cup, Chen et al. [7] proposed an IOSUDA framework including feature and output space alignment while simultaneously introducing adversarial learning into the learning process of segmentation networks; shared features of multiple domains are introduced in the input space. In this paper, we propose an unsupervised domain adaptation with a shape constraint for joint optic disc and cup segmentation. The comparison of our method with previous approaches in terms of used dataset, learning method, supervision method, and use of U-Net, GAN, attention mechanism, and prior geometric constraint (or not) is summarized in Table 1.

Attentional Mechanism
In recent years, many researchers have proposed combining attention mechanisms with deep Convolutional Neural Networks (CNNs) to improve large-scale visualization. Double Attention Networks (A 2 -Nets) [38] use a "double attention block" method that counts and propagates information-rich global features from the input image/video over the entire time and space. Global Second Order Pooling Network (GSoP-Net) [39] uses second-order pooling to collect important features from the entire input space and then distributes them to make further layers easier to verify and disseminate. In addition, an innovative NL block combined with an SE block has been proposed by Global-Context Networks (GC-Net) [40] in an effort to more effectively combine contextual representation with channel weighting. Images can be be segmented and classified using attention processes as well. Criss-Cross networks (CCNet) [41] and SPNet [42] have proposed a new cross-attention module that captures rich contextual information on its cross-paths. A pipeline based on two top-down and two bottom-up attention modules has been presented by Xiao et al. [43] for classifying images.

SCUDA Framework
The proposed SCUDA model inherits the IOSUDA pipeline, and is formed from two parts: the image translation model and the segmentation model. Figure 2 shows the overview of SCUDA. The images from the source domain (X s ), the truth labels from the source domain (Y s ), and the images from the target domain (X t ) are the data utilized in training. The picture translation model applies unsupervised transformation between the source and target domains with the goal of learning the shared content features and the corresponding style features. Here, X s−t denotes the transformed image dataset; conversely, X s−s denotes the reconstructed image dataset, while the combination of content and style features is represented by the symbol ⊕. In addition, a shape-constrained loss function L shape for segmentation is designed to incorporate the prior shape information of the segmentation region of the optic cup-optical disc, with the purpose of constraining the segmentation region predicted by the network to ensure that it lies within a feasible configuration space. Moreover, a convolutional triple attention module (CTAM) is adopted for purpose of improving the codec of the segmentation network, which can establish interdependencies between channels or spatial locations to achieve cross-dimensional interactions and boosts segmentation performance. The datasets generated by target-domain and source-domain segmentation are denoted by the variables Y t and Y s , respectively. The segmentation network may be optimized via adversarial learning of the segmentation maps of the source and target domains. Additionally, the segmentation maps produced by the target domain are superior.

The Proposed Shape-Constrained Loss Function
Shape information is an important and meaningful a priori indicator for organ segmentation in medical images. Although different datasets of fundus images may appear quite different due to scanning machines, procedures, stages, etc., they should have the same representation of anatomical structures, i.e., contours, of the optic cup and optical disc, which are all circular-like, or at least not very far from a circle. This prior shape information can be used as an indicator to constrain the segmentation results. Specifically, the result of segmentation of a fundus image corresponds to the two contour boundaries of the optic cup and optical disc, respectively, as shown in Figure 3. In the GT diagram in Figure 3, the green contour line indicates the optic disc segmentation and the blue contour line indicates the optic cup segmentation. We denote the set of coordinates of the contour boundaries of the optic cup-optical disc as I cup and I disc , respectively. Accordingly, the equation for calculating the center of mass of the optic disc is expressed as where (C X , C Y ) denotes the centroid of the optic disc, C X , C Y are the x-coordinate and y-coordinate component, respectively. Similarly, the equation for the center of mass of the optic cup is expressed as where (D X , D Y ) is the centroid of the apparent cup, D X is the x-coordinate of the center of mass of the apparent cup, and D Y is the y-coordinate of the center of mass of the apparent cup. An illustration of computed centroids is shown in Figure 4, marked by dots.
Original image GT Optic disc region boundary Optic cup region boundary  If the contour boundary of a region is a circle, the distances of each point on the contour to its centroid are equal, and consequently are the same as their mean. In view of this, we use the mean deviation of distances from their mean as a measure of the deviation of a circular-like contour, which is normalized by dividing the mean distance in order to eliminate the scale variations. The proposed shape-constrained loss function for segmenting the optic cup is formulated as follows: where E i cup = C − m i 2 , m i ∈ I cup is the distance of the ith point (i ∈ [1, k]) on the contour of optic cup region to its centroid, k denotes the number of points on the discrete contour, and m cup represents the mean distance, defined by By the same token, we can define the shape-constrained loss function for segmenting the optic disc, which is denoted by L disc . Taken together, we obtain the shape constrained loss function L shape for segmenting fundus images as follows:

Total Loss Function
The loss function of the SCUDA framework includes the loss of the image translation module and the loss of the image segmentation module in addition to the shape-constrained loss. For the image translation module, let E C denote the content encoder, E S the style encoder, C the shared content feature space, S S and S T the style feature space in the source domain and the target domain, respectively, G the shared decoder, and L1 the L1 distance. For a source domain image x s ∈ X S , c s , c t ∈ C, s s ∈ S S , s t ∈ S T , the source domain image loss L x s rec , the source domain image content feature loss L c s rec , and the source domain image style feature loss L s s rec are defined as follows: where E z indicates computing the expectation of a function of z. The target domain image loss L x t rec , its content feature loss L c t rec , and its style feature loss L s t rec are defined analogously to the loss of the source domain image. For source domain to target domain image translation, the discriminator D 1 aims to distinguish the target domain image x t from the transformed image x s−t , while the discriminator D 2 aims to distinguish the source domain image x s from the transformed image x t−s , with the former loss function being denoted by L The L x s ,x t−s dis loss function is defined similarly to L x t ,x s−t dis . The total loss of the image translation model is defined as follows: where µ 1 , µ 2 , µ 3 , µ 4 denote the weights of each component. In the image segmentation module, the segmentation of the optic cup-optical disc is converted to a multi-classification assignment with the segmentation label map y s ∈ R H×W×C , where H × W is the image height and width and C is the number of categories. The segmentation network takes c s as input to obtain a predicted segmentation map y s ; similarly, c t is taken as input to obtain a predicted segmentation map y t . In addition, the role of the discriminator D is to determine that y s is true and y t is spurious. The output size of the discriminator is m × n, and its loss function is defined by L y s ,y t dis = ∑ m,n log ((D(y s )) (m,n) ) + log (1 − (D(y t )) (m,n) ).
The split loss function of y s is as follows: .
In order to make y t and y s have similar definitions, the discriminator is confused in order to judge the patches of y t as true. The adverse loss is defined by L y t adv = ∑ m,n log ((D(y t )) (m,n) ) (13) The total loss of the image segmentation network is defined as follows: where δ 1 and δ 2 denote the weights of each component. Because of the source and target domains, there are four terms of shape constrained losses during gradient backpropagation. The total shape-constrained loss function L total shape is therefore Taken together, the total loss of the proposed model is

Convolutional Triple Attention Module (CTAM)
The shared feature content obtained by the image translation model is later fed to the segmentation network for segmentation. Concretely, The segmentation network makes use of an adjusted U-Net, and as the shared content features are downsampled from the original image to be used in the segmentation, the first two downsampling layers of the network are eliminated to satisfy the dimensionality requirement. Convolutional Triple Attention Module (CTAM) [15], a compact yet powerful attention module, is designed and deployed to the interface between the innermost encoder and decoder of U-Net in this paper to further enhance the segmentation network. CTAM captures cross-dimensional interactions of a tensor input by establishing inter-dimensional correlations through a rotation operation and subsequent residual transformations. By computing the attention weights, it generates a large number of feature representations and produces a refined tensor with the same form as the input. The detailed structure of the CTAM is shown in Figure 5. CTAM contains three parallel branches, two each to capture the interaction between channel dimension C and a spatial dimension, i.e., H or W. The output of all three lines is determined using a straightforward averaging method, with one line being utilized to develop spatial attention. More specifically, CTAM accepts an input tensor x ∈ R C×H×W , where C denotes the number of channels and H and W denote the height and width of the spatial feature mapping, respectively, which is first passed to each of the three branches. The height and the channel dimension create an interaction in the first branch. Then, x is rotated 90 • counterclockwise along with the H axis, recorded as x 1 with the shape (W × H × C), which is minimized to x 1 with the shape (2 × H × C) after Z-pool; x 1 later goes through the convolution layer, followed by a batch normalization layer. Moreover, attention weights are obtained by sending them to the sigmoid activation layer. To retain the basic input form of x, the created weights are employed in x 1 and the result is rotated 90 • clockwise along with the H axis. The tensor of the first branch that is generated at the conclusion is defined as x * 1 . Likewise, in the second branch, a 90 • counterclockwise rotation along the W axis is applied to x with the same principle as in the first branch to obtain the refined x * 2 . The last branch, where the z-pool reduces the channels of the input tensor x to two, produces x 3 , which has the shape (2 × H × W), and is then processed by a convolution layer. Then, it proceeds successively through a batch normalization layer. Through the sigmoid activation layer, the output generates an attention weight with the shape (1 × H × W); the tensor of the final branch generated at the end is defined as x * 3 . The refined tensor of shape (C × H × W) generated by a simple averaging pool of data generated by three branches.
To sum up, for an input tensor x ∈ R C×H×W , the following equation illustrates how the refinement tensor y is obtained from the three branches: . (17) where ω 1 , ω 2 , and ω 3 are the three cross-dimensional attention weights calculated in the triple attention.
It is worth noting that the incorporation of CTAM into U-Net is based on the following considerations. Despite being widely used, U-Net can be further improved for various segmentation tasks, especially through attention mechanisms, with the motivation of ensuring that the network devotes more focus to the important parts of the data. Remarkably, the parameters related to attention mechanisms can be learned without introducing additional losses. Many works have used attention mechanisms to improve U-Net for medical segmentation [44][45][46][47], including segmentation of the optic disc and cup [10,11,48]. However, the attention methods used in these works require quite a number of learnable parameters, which can easily lead them to suffer from overfiting problems in view of the limited training data in many medical segmentation tasks. Fortunately, a cheap and very effective attention method, namely, CTAM, was proposed in [15] with the aim of capturing cross-dimension interaction while computing attention weights to provide rich feature representations. It has demonstrated the ability to provide similar or better performance to the alternatives. In light of these advantages, in this paper we apply this triplet attention method to boost the performance of U-Net. Because the attention triplet receives an input tensor and outputs a refined tensor of the same shape, it can be applied to any layer to enhance the feature representation there. To avoid increasing too many parameters, we only apply it to the deepest layer of U-Net, as this is the layer with the most abstract representation, which we believe should have the greatest effect on the final result. Trivially, the CTAM becomes an identity map when, say, the convolutional layers in the CTAM have zero kernels and the cross-dimensional attention weights sum to 1. Hopefully, a CTAM can be learned that performs better than this trivial case.

Implementation Details
The network model for this experiment used the Pytorch framework, and training/testing was performed on an RTX3090 with 24 GB of memory. A pre-trained model [3] was used to locate the optic cup and optical disc region of the fundus images in the experimental dataset. Training images were then obtained by cropping and scaling, and the training images were normalized, randomly inverted, and cropped for input. In addition, random seeds were fixed in the experiment. The size of the input training image was 256 × 256 pixels. The whole model framework was optimized using the Adam method with a batch size of 2 and a training period of 400 epochs, and the initial learning rate was set to 10 −4 .

Datasets
The RIM-ONE_r3 [49] dataset, the Drishti-GS [28] dataset, and the REFUGE [50] dataset were the three publicly available fundus imaging datasets used in this experiment. They have different appearances, as shown in Figure 6. Following [7], the datasets from the source and target domains were split into a training set and a test set for this experiment. The RIM-ONE_r3 dataset with Drishti-GS was employed as the target domain, while the REFUGE training set served as the source domain. Table 2 shows the statistical distribution of the data. REFUGE RIM-ONE_r3 Drishti-GS Figure 6. Example fundus images from different datasets. From left to right: REFUGE [50], RIM-ONE_r3 [49], and Drishti-GS [28].

Evaluation Metrics
This experiment uses the IoU coefficient of the optic cup and optic disc along with their Dice coefficient as evaluation indicators. TP (True Positives), FP (False Positives), and TN (True Negatives) are the number of pixels in the segmentation which match the ground truth (for TN/TP) or do not (FP/FN): The higher the Dice and IoU values, i.e., the closer they are to 1, the better the segmentation performance of the model. IoU OD and Dice OD denote the IoU and Dice values of the optic disc, respectively, while IoU OC and Dice OC denote the IoU and Dice values of the optic cup, respectively.

Quantitative and Qualitative Analysis
We compare our method with five state-of-the-art methods for segmenting the optic cup-optical disc on two datasets, namely, RIM-ONE_r3 [49] and Drishti-GS [28], to show the efficacy of the SCUDA framework proposed in this study. The methods for comparison are classified into two types. One kind is a model without domain adaptation, such as CycleGAN [51] and Pix2Pix [16]. CycleGAN is an image transformation model based on mismatch, which can transform fundus images into segmentation images to achieve target segmentation. Numerous studies have utilized Pix2Pix, a conditional adversarial generative network (cGAN), for segmentation tasks [52,53]. Another type of unsupervised domain adaptive models include SynSeg-Net [24], SIFA [26], and IOSUDA [7]. In the input space, SynSeg-Net provides picture alignment. Feature alignment and output space alignment are combined by SIFA. Therefore, in our evaluation, CycleGAN and Pix2Pix are trained using the source domain dataset. On the other hand, SynSeg-Net, SIFA, IOSUDA, and the SCUDA model proposed in this paper are trained using data from the source domain and the unlabeled target domain's training portion, while the test data come from the target domain. Table 3 reports the experimental results. As can be seen, the RIM-ONE_r3 dataset is more difficult than the Drishti-GS dataset, as all the metrics of the tested methods are significantly lower on the former, reflecting the more severe domain drift of the RIM-ONE_r3 dataset compared to the Drishti-GS dataset. Remarkably, our SCUDA method achieves the best performance in terms of all metrics on both datasets. For example, on the RIM-ONE_r3 dataset, our method outperforms the second-best method, IOSUDA, by 1.83%, 2.02%, 1.66%, and 1.73% in IoU OD , IoU OC , Dice OD , and Dice Oc , respectively. On the Drishti-GS dataset, our method likewise outperforms the second-best method, again IOSUDA, by 1.26%, 1.70%, 1.41%, and 1.49%, respectively. Results such as those above demonstrate how well our proposed SCUDA model works.
On eight test samples from RIM-ONE_r3 and Drishti-GS, Figures 7 and 8 compare our method qualitatively to two state-of-the-art methods, including the baseline IOSUDA method and SIFA. Concretely, the first and second columns of Figures 7 and 8 show fundus images and the corresponding ground truths, and other columns show the fundus images with the boundary of the optic cup-optical disc marked by different methods. The green contour lines in the figure indicate the optic disc segmentation results and the blue ones indicate the optic cup segmentation results. It can be observed that our SCUDA approach demonstrates better segmentation results with relatively smooth and accurate segmentation contours in all these cases, regardless of whether the fundus images contain clear contours or blur contours, while the other methods produce malformed segmentations in most of these cases. We ascribe this to the effectiveness of the proposed shape constraint, which embeds domain-invariant prior knowledge concerning the circular shape of the optic cup and optical disc into our model.

Ablation Study on the Impact of the Weight of the Shape Constraint
We evaluated the proposed SCUDA on the RIM-ONE_r3 dataset with regard to the weight of the shape constraint loss in order to investigate the effects of the shape constraint weight on the effectiveness of segmentation. The weight ranges were from 0.2 to 2.0 with a step size of 0.2. The four metrics of SCUDA for the different weights are shown in Table 4.  It can be seen that when the weight of the loss function is 1.2, SCUDA achieves the best IoU OD and Dice OD , which are 84.84% and 94.65%, respectively; when the weight is 1.0, SCUDA again achieves the best IoU OC and Dice OC , which are 61.19% and 73.65%, respectively. Overall, the best performance is obtained when the weight is 1.0, which is the default setting of the weight in our proposed SCUDA. The trend of the average of the four metrics as the weight changes from 0.2 to 1.8 is plotted in Figure 9 to provide a more intuitive grasp of the influence of this weight. Note that the average of the four metrics is plotted by the gray dotted line. As can be seen, when the weight changes from 0.2 to 1.0, the IoU OC and Dice OC show an increasing trend overall, except for a drop at 0.8. Although the increasing trend is not obvious for IoU_OD and Dice_OD, apparent drops can be observed at 1.4 for both metrics. On average, when the weight increases from 1.0 to 1.8, a decreasing trend is observed on the whole, except for a rise at 1.6. These results suggest that the segmentation performance can be improved if the shape constraint is imposed moderately. In order to understand the impact of the weight of the shape constraint more intuitively, we show five examples of segmentation with different weights in Figure 10. It can be seen that, the segmentation results become visually better and better as the weight goes from 0.4 to 1.0. This justifies the effectiveness of the shape constraint for optic cup-optical disc segmentation and conforms to the fact that the shape-constrained loss function is based on an approximately (though not strictly) correct assumption, namely, that a constraint that is too strong leads to false prior information being imposed on the trained model.

Ablation Study on the Effect of the Proposed Components
To demonstrate the efficiency of the two components proposed in this paper, that is, the shape-constrained loss and the CTAM moduel, an ablation study was carried out. In this experiment, IOSUDA was the baseline model. Depending on whether or not each component was incorporated or not, there were four candidate models: (1) IOSUDA, (2) IOSUDA+ L total shape , (3) IOSUDA+CTAM, and (4) SCUDA. IOSUDA+L total shape denotes the addition of the shape-constrained loss function L total shape to the IOSUDA model, IOSUDA+CTAM indicates that the convolutional triple attention module CTAM was added to the IOSUDA model, and SCUDA indicates that the shape-constrained loss function L total shape and the convolutional triple attention module CTAM were both added to the IOSUDA model. These four models were evaluated on the RIM-ONE_r3 and Drishti-GS datasets. Table 5 reports the experimental results. To aid with more intuitive understanding, the results are plotted in the bar chart shown in Figure 11. It can be seen that, compared with IOSUDA, both IOSUDA+L total shape and IOSUDA+CTAM improve the IoU and Dice values of the optic cup and optical disc on the test dataset, which proves the effectiveness of the shape-constrained loss function and the CTAM module proposed in this paper. Specifically, taking Dice OC as an example, IOSUDA+L total shape and IOSUDA+CTAM show improvements of 1.33% and 0.34%, respectively, over the base IOSUDA model on the RIM-ONE_r3 dataset. As for IoU OD , on the Drishti-GS dataset, IOSUDA+L total shape and IOSUDA+CTAM show improvements of 0.97% and 0.52%, respectively, over IOSUDA. Overall, the module result of IOSUDA+L total shape is better than the module of IOSUDA+CTAM, although the best outcome on both datasets is only reached when the two modules are combined, that is, in SCUDA. The effectiveness of the proposed components is therefore justified.

Conclusions
In this paper, we propose an unsupervised domain adaptation with shape constraint for joint optic disc and cup segmentation, which we call SCUDA. A shape-constrained loss function is novelly proposed in this paper, which utilizes domain-invariant prior knowledge about the segmentation region of the optic cup-optical disc in fundus images to constrain the segmentation results during network training. Moreover, we design a convolutional triple attention module in the segmentation network that captures crossdimensional interactions and provides rich feature representation in order to improve the segmentation performance of the network. Extensive experiments show that the proposed SCUDA framework outperforms state-of-the-art methods for segmentation of the optic cup and optical discs on both the RIM-ONE_r3 and Drishti-GS datasets.
Compared with existing method, we make the first attempt to use prior shape constraints to develop models for joint optic disc and cup segmentation, and use a cheaper yet more effective attention method to boost the performance of U-Net. It is worth noting that, in this work, the shape-constrained loss function is based on an approximate assumption, not a strictly correct one. Our future work will include investigating more realistic shape assumptions to construct constraints for training, along with a more effective and efficient attention mechanism for improving U-Net and novel frameworks of unsupervised domain adaptation for transfer learning.