Semi-Supervised Medical Image Segmentation with Co-Distribution Alignment

Medical image segmentation has made significant progress when a large amount of labeled data are available. However, annotating medical image segmentation datasets is expensive due to the requirement of professional skills. Additionally, classes are often unevenly distributed in medical images, which severely affects the classification performance on minority classes. To address these problems, this paper proposes Co-Distribution Alignment (Co-DA) for semi-supervised medical image segmentation. Specifically, Co-DA aligns marginal predictions on unlabeled data to marginal predictions on labeled data in a class-wise manner with two differently initialized models before using the pseudo-labels generated by one model to supervise the other. Besides, we design an over-expectation cross-entropy loss for filtering the unlabeled pixels to reduce noise in their pseudo-labels. Quantitative and qualitative experiments on three public datasets demonstrate that the proposed approach outperforms existing state-of-the-art semi-supervised medical image segmentation methods on both the 2D CaDIS dataset and the 3D LGE-MRI and ACDC datasets, achieving an mIoU of 0.8515 with only 24% labeled data on CaDIS, and a Dice score of 0.8824 and 0.8773 with only 20% data on LGE-MRI and ACDC, respectively.


Introduction
Currently, there are various new technologies and devices that assist in clinical diagnostic work [12,24,8], among which medical image segmentation plays an important role in clinical auxiliary diagnosis [44].Recently, researchers have made great efforts in medical image segmentation [31,53,11] and achieved excellent performance with a large amount of labeled data.However, the annotation of medical data are typically dependent on medical professionals, and annotating large datasets is time-consuming.
To address this problem, semi-supervised (SS) medical image segmentation leverages a large amount of unlabeled data in conjunction with a small amount of labeled data to improve model performance.Particularly, unlabeled data are relatively affordable, as the laborious annotation process can be avoided.Recently, consistency regularization methods [49,20,4,51,15] have received great attention in SS medical image segmentation.The primary difference of various consistency regularization methods lies in their intended objectives.For example, the majority of perturbation consistency methods [4] tend to maintain a consistent unlabeled prediction with various augmentations.Uncertaintyaware methods [49,15] force consistent predictions in reliable regions of two corresponding models.Multi-task based methods [20,51] design a multi-task framework to guarantee an invariant relation of unlabeled data among different tasks.Despite the fact that these consistency regularization methods have obtained encouraging results, most of the previous methods neglect two essential problems: class imbalance and the mismatch of class distributions between labeled and unlabeled data.For the first problem, there have been some effective and efficient solutions [18,16] to deal with class imbalance in the fully supervised scenario.However, these methods are unsuitable for the semi-supervised case because long-tail samples and noisy samples are usually difficult to identify in unlabeled data.For the second problem, ReMixMatch [2] proposed a coefficient transform to align labeled and unlabeled class distributions for SS classification.One of their key limitations, however, is that it only considers the empirical ground-truth class distribution, which could be highly imbalanced or even biased when the labeled data are scarce.In addition, estimating the labeled distributions may lead to an unaffordable computational cost for dense prediction tasks such as image segmentation.
In this paper, we focus on maintaining consistent distributions of labeled and unlabeled data under a co-training framework, unlike existing consistency regularization methods.However, it is non-trivial to achieve class distribution consistency.In particular, the empirical class distribution could be highly imbalanced or even biased when the data are sparsely labeled.Figure 1a shows the vanilla Distribution Alignment (DA) [2] in which an overall output class distribution on unlabeled data are maintained to align model outputs to the empirical ground-truth class distribution.More specifically, DA maintains a running average of predictions on unlabeled data.When the model outputs a prediction for an unlabeled sample, the distribution alignment scales the prediction by the empirical class distribution on labeled data over the average predictions on unlabeled data, so as to obtain an output that is aligned to the ground-truth class distribution.On the other hand, Figure 1b shows the cross-pseudo supervision [7] based on co-training [3] that uses two parallel networks with identical architecture but different initializations, and then uses the output of one network to supervise the other one.Inspired by ReMixMatch and cross-pseudo supervision, we propose a novel Co-Distribution Alignment (Co-DA) method to overcome class imbalance and the mismatch of class distributions by integrating the above methods into a unified learning framework.Specifically, Co-DA transforms the model output according to the ratio of the class-specific marginal distribution on labeled data over the average model predictions on unlabeled data for that same class and supervise the model from the other view as pseudo-labels.Different from ReMixMatch, Co-DA uses an exponential moving average (EMA) to simplify the estimation of the labeled distribution.More importantly, instead of relying solely on a single empirical ground-truth class distribution, we seek to fully exploit the model prediction for all classes, aiming to minimize the class-dependent distribution discrepancy between the model outputs on the labeled and the unlabeled data.Therefore, we preserve independent distributions for each class instead of an overall distribution as shown in Figure 1a,c.On the other hand, in contrast to typical co-training, Co-DA tends to keep the consistency between labeled and unlabeled distributions rather than the cross-consistency of predictions as shown in Figure 1b,c.Therefore, Co-DA is more computationally efficient for dense prediction tasks such as image segmentation and is less likely to be affected by errors at individual pixels within the co-training framework.To further reduce the impact from inaccuracies in the model prediction on unlabeled data, we design an over-expectation cross-entropy loss to filter out noises in pseudo-labels.Our main contributions can be summarized as follows:

M1
• To our knowledge, we are the first to solve the semi-supervised medical image segmentation problem with distribution alignment.In particular, we propose class-wise distribution alignment that utilizes the class-dependent output distribution instead of the overall empirical ground-truth class distribution, which could be highly imbalanced and biased when the labeled data are scarce.• Our co-distribution alignment framework is more computationally efficient for dense prediction tasks involving a large number of pixels as compared to typical co-training methods such as CPS [7].More importantly, distribution alignment has better regularization, as the proposed method provides superior performance.• To further reduce the impact from inaccurate predictions on unlabeled data, we propose a simple, yet effective over-expectation cross-entropy loss to filter out noises in pseudo-labels.• Experimental evaluation results on three publicly available medical imaging datasets demonstrate the superior performance of our approach compared to the state-of-the-art methods.Moreover, ablation studies also verify the efficacy of the various components in Co-DA.• Our method does not depend on a particular deep network architecture.Therefore, it can be used in conjunction with different models for medical image segmentation as a plug-and-play module to address the challenges of learning from imbalanced data and the distribution mismatch between labeled and unlabeled data.
The rest of the paper is organized as follows.Section 2 reviews recent literature in the areas of deep semi-supervised learning, semi-supervised medical image segmentation, co-training and distribution alignment methods.Section 3 describes the proposed method in detail, followed by experimental evaluation in Section 4 and closing remarks in Section 5.

Related Work
In this section, we review related literature in semi-supervised medical image segmentation.We first review recent work in deep semi-supervised learning, and then more specifically in semi-supervised medical image segmentation, as well as progress in co-training and distribution alignment methods that are closely related to our approach.

Deep Semi-Supervised Learning
Recently, semi-supervised learning (SSL) has made remarkable progress in various machine learning tasks.SSL methods can be broadly categorized into pseudo-labeling methods, consistency regularization methods, entropy minimization methods and hybrid methods.Specifically, pseudo-labeling methods [14,13,17,50] aim at obtaining the pseudo-labels of unlabeled data by self-training.Consistency regularization methods [42,20,33,47] force similar predictions under different perturbations of unlabeled data to expand the decision regions.Entropy minimization methods [10,22] tend to make the decision boundary follow low density regions with the help of unlabeled data.Hybrid methods [36,2,40] simultaneously combine some advantages of the above SSL methods.Our method belongs to both pseudo-labeling and consistency regularization methods, yet it addresses a critical problem that is largely ignored in existing methods.To be specific, most existing methods neglect class imbalance and the mismatch of class distributions between labeled data and unlabeled data, which are common in medical images and severely affect the performance of SSL methods.

Semi-Supervised Medical Image Segmentation
The distinctive appearance and class distribution characteristics of medical images pose unique challenges in applying SSL methods to them.In particular, medical image segmentation usually involves localizing objects with extreme shape and scale variations, and the class distribution could be highly skewed.Similar to generic SSL methods, common semi-supervised medical image segmentation methods include GAN-based, consistency regularization and pseudo-labeling methods.Specifically, GAN-based methods [38,27] attempt to use adversarial training to fool the discriminator with unlabeled data.For example, DCT-Seg [28] utilizes two models that are co-trained to generate pseudo-labels for each other.In addition, SS-Net [52] proposes a collaborative learning method to jointly improve the performance of disease grading and lesion segmentation with an attention mechanism.On the other hand, different from consistency regularization methods in generic SSL, for medical image segmentation, people usually design a strategy to keep consistency in local regions, e.g., regions of low uncertainty [15], regions of random category [46], etc.Finally, pseudo-labeling methods leverage an auxiliary model to generate pseudo-labels for unlabeled data.Unlike existing methods, our Co-DA focuses on exploring the discrepancy between labeled and unlabeled data distributions, which is of vital importance in SS medical image segmentation due to data scarcity and imbalance.

Co-Training
The original co-training algorithm [3] assumes that there are two naturally segmented views of the same instance, which are redundant and independent under certain conditions.Concretely, data from any of them are sufficient to train a strong learner, and the views are independent of each other.More specifically, the main steps of the co-training algorithms are divided into view acquisition, learner differentiation and label confidence estimation [25].Recent progresses in co-training [30,28,6] primarily focus on maintaining the diversity across models with deep networks.In particular, cross-pseudo supervision [7] proposes to impose consistency on two segmentation networks perturbed with different initialization for the same input image.In contrast, Co-DA uses co-training to align the distributions of labeled and unlabeled predictions.

Distribution Alignment
Distribution alignment is widely used in domain adaptation.These methods work by aligning marginal distributions [37,9] or joint distributions [43,34,32] of different domains.In semi-supervised learning, domain alignment has also been considered to close the gap between predictions on labeled and unlabeled data.Compared to our proposed approach, the most relevant existing method is ReMixMatch [2], which uses a coefficient transformation to align the marginal distributions of labeled and unlabeled data.However, class distribution in medical images can be highly imbalanced, while ReMixMatch treats samples from all classes as a whole and neglects the variations in marginal distribution of each class.Different from ReMixMatch, in this work, we align marginal distributions in a class-wise manner that is more general and works better for imbalanced datasets with minority classes.

Our Approach
In this section, we describe the proposed approach in detail.We begin by revisiting the cross-pseudo supervision framework that our work is based on.Afterwards, we describe the three main components of Co-Distribution Alignment, i.e., marginal distribution estimation, class-wise distribution estimation and distribution transformation.In addition, we introduce an over-expectation cross-entropy loss that is used to further improve model performance by filtering out inaccurate pseudo-labels.The overall framework is presented in Figure 2.For labeled data, the standard cross-entropy loss (CE Loss) is used.For unlabeled data, ŷu1 and ŷu2 are pseudo-labels given by Equation ( 8), and we use the aligned output distribution from one network to supervise the other one.For the over-expectation cross-entropy loss (O-E Loss) in Equation ( 10), p u = max(F(x u )) denotes the class probability corresponding to the pseudo-labels, t stands for the threshold given by Equation ( 9) and the O-E Masks are the thresholded binary masks to filter out pixels with low confidence in their pseudo-labels.We use class-wise DA to align marginal distributions for each class to better deal with imbalanced datasets.Here, M l and M u denote the labeled and the unlabeled marginal distributions.

Cross-Pseudo Supervision
Cross-pseudo supervision works by generating two views of the same input image with differently initialized segmentation networks.In this way, one network can discover the self-mistakes made by the other model [3,7].Following [7], we also use two parallel networks F 1 , F 2 with the same structure but different initialization for collaborative training.
To define our problem more formally, and without loss of generality, we view the medical image segmentation problem as a pixel-wise classification one since our focus in this paper is the cross-alignment of marginal distributions.More specifically, let D l = {(x l i , y l i )} m i=1 denote the labeled dataset and D u = {x u j } n j=1 the unlabeled dataset.Here, x l i and x u j are the image pixels, and y l i is the class label for x l i .The loss function on the labeled dataset can be written as: For the unlabeled data, we adopt the cross-pseudo supervision loss, which uses the pseudo-labels generated by one model to supervise the prediction of the other model, as follows: where ŷu1 j , ŷu2 j are the pseudo-labels generated by F 1 and F 2 , respectively.Here, pseudo-labels are the mode of prediction, i.e., the label derived from the most probable class.
Equation (2) shows that pseudo-labels determine our optimization objective.Therefore, most existing methods [7,6] focus on correcting or generating more accurate pseudo-labels to supervise unlabeled predictions.However, their methods neglect the fundamental challenges of learning from imbalanced data and the distribution mismatch between labeled and unlabeled data.In the next section, we introduce the proposed Co-DA that specifically addresses these problems.
Let us take a closer look at the pseudo-labels in Equation (2).We can conjecture that, ideally, if the pseudo-labels ŷu1 j , ŷu2 j follow the class distribution on labeled data, applying Equation (2) would already align the unlabeled class distribution to the labeled class distribution.However, due to class imbalance in medical image segmentation, the distribution of pseudo-labels could be biased toward majority classes.In this paper, we propose to address this problem by explicitly aligning the marginal distributions on unlabeled and labeled data after cross-pseudo supervision.In particular, our Co-DA considers K class-dependent distributions (where K is the number of classes) instead of a single overall empirical class distribution in naive DA.This approach allows us to capture and align finer class-wise statistics in the marginal distributions, as we describe in the next section.

Co-Distribution Alignment
In this section, we outline the proposed Co-Distribution Alignment in detail.Specifically, we first introduce an efficient method to estimate the class distribution on labeled data during training, and then discuss class-wise distribution estimation for capturing detailed statistics in labeled and unlabeled marginal distributions, followed by the distribution transformation to align the unlabeled distributions to labeled distributions.

Marginal Distribution Estimation
The vanilla DA in ReMixMatch [2] needs to use the empirical ground-truth class distribution for aligning model predictions on unlabeled data to it.However, estimating ground-truth class distribution becomes computationally expensive for segmentation tasks where a single image contains a large number of pixels, and using only a small amount of labeled data is inappropriate, as the estimation may not well represent the overall class distribution.To address this issue, we use the exponential moving average (EMA) to estimate the aggregated class predictions on labeled data, which can be written as: where T l and T l are the labeled distributions of the current iteration and the previous iteration, α determines that T l is the average prediction of last 1/(1 − α) iterations and Θ represents the parameters of the network.It should be noted that our work differs from ReMixMatch [2] in that we use EMA to estimate the class predictions on labeled data, which is more computationally efficient for dense prediction tasks.In addition, we note that the MLE of observations can be given by: where X is the entire dataset that includes both labeled and unlabeled data.Obviously, Θ is related to the likelihoods of both labeled and unlabeled data.Therefore, the labeled distribution estimated by Equation (3) considers both labeled and unlabeled data to estimate the labeled distribution and only costs manageable computations during training.

Class-Wise Distribution Estimation
An important limitation of the original DA is that class imbalance may cause the tail classes to diminish from the estimated distribution, as they only represent a small fraction of labeled data, ultimately leading to a biased distribution transformation.Unlike naive DA [2], Co-DA builds K independent distributions for each class, where K is the number of classes and each distribution is class-specific.Therefore, Co-DA is more robust to class imbalance due to the decoupled modeling process.Specifically, we maintain two matrices M l ∈ R K×K and M u ∈ R K×K for labeled and unlabeled distributions, respectively, in which row i is the distribution for the i-th class.For the i-th row of the labeled matrix M l i , we use Equation ( 3) to approximate the marginal distribution according to the ground-truth class label as follows: where is the indicator function.For the i-th row of the unlabeled matrix M u i , we update the marginal distribution by EMA according to the model prediction following ReMixMatch [2].The class membership of unlabeled data are obtained by the mode of prediction, i.e., the most probable class.However, we observe that the tail categories in the unlabeled distributions may be difficult to update due to their infrequent presence in a batch.In this case, we use the inverse transformation from labeled distribution to unlabeled distribution to update the unlabeled distribution.The overall update strategy of M u i can be written as follows: where ŷu = arg max F(x u ) and Mu i is the unlabeled distribution in the previous iteration.This update strategy stems from the original distribution alignment proposed in ReMixMatch [2], but here, we use it to estimate the unlabeled distribution when the unlabeled data from a certain class (usually the minority classes) are absent in a batch.In addition, we use EMA to maintain a stable update of the distributions required for our Co-DA in the training process.It should be noted that, in practice, there are two models (i.e., F 1 and F 2 ) in our method, and each model will be used to update their own M l and M u ; we omit their subscripts for notational simplicity.

Distribution Transformation
Following ReMixMatch, we align the distributions using coefficient transformation but with each class functioning independently.Besides, we use a temperature τ to scale the labeled distribution following CReST [45].Different from CReST, however, we choose an adaptive temperature according to the class-specific aggregated model prediction.More specifically, the temperature and the transformation of class i is given by: where [ŷ u = i] denotes that, according to the network, the i-th class is the most probable, and we therefore use the i-th row in M l and M u , i.e., M l i and M u i , for the coefficient transformation.In Equation ( 8), ⊗ and denote the element-wise product and division, respectively.In addition, F(x u ) and F (x u ) denote the network prediction for x u before and after distribution transformation.
We note that the temperature τ i further prevents the transformed distribution to be dominated by majority classes.Specifically, τ i → 0 makes the labeled distribution closer to uniform distribution, while a larger τ i results in a smaller shift.τ i not only makes the model more robust against noises in pseudo-labels in the early training stage, but also encourages the emergence of minority classes in the middle and late stages.
After distribution transformation, the network prediction after alignment, F (x u ), is then used for cross-pseudo supervision according to Equation (2), instead of F(x u ) as originally shown in Equation (2).

Over-Expectation Cross-Entropy Loss
In order to further reduce the negative impact of inaccurate pseudo-labels, we propose an over-expectation cross-entropy loss for learning from unlabeled data.Motivated by the definition of EMA, M l and M u can be regarded as the expectations of the labeled and unlabeled distributions, respectively.Intuitively, we can use the estimated aggregated model prediction for the i-th class as an adaptive threshold to filter out unlabeled samples below expectations to reduce noise.More specifically, the threshold t(i) for the i-th class is given by: As we will show in Section 4.5, this adaptive and dynamic threshold is consistently superior to different static threshold values and do not bring in additional hyperparameters.In this way, Equation ( 2) can be rewritten as: where p u i stands for the probability corresponding to the pseudo-labels, ŷu i stands for the class prediction of the pseudo-labels and we refer to our loss in Equation ( 10) as the over-expectation cross-entropy loss (O-E Loss).In a nutshell, this is an extension to the soft label cross-entropy loss that incorporates the threshold filtering in Equation ( 9).The logarithm comes from the cross-entropy between two probability distributions, and we refer readers to [35] for further background information.See Figure 2 for the O-E Masks 1 and 2 as an illustration for applying Equation (10)  Calculate the supervised loss L s (x l , y l ) according to Equation (1); 8: Calculate the unsupervised loss L u (x u , ŷu1 , ŷu2 ) according to Equation (10);

Experiments
In this section, we thoroughly verify the efficacy of the proposed Co-DA on three challenging public medical image segmentation datasets.Specifically, we first outline the experimental setup, including an introduction to the datasets, the evaluation metrics and our implementation details in Section 4.1, followed by quantitative and qualitative results on the three datasets in Section 4.2, Section 4.3 and Section 4.4, respectively.We also present ablation studies in Section 4.5 to demonstrate that our method provides a strong performance that is comparable to its fully supervised variant, and the individual components proposed in our method are all contributing to the performance of our model.

CaDIS
We first evaluate the performance of our method on a 2D medical image segmentation task.The publicly available CaDIS dataset [5,41] consists of 4671 frames from 25 surgical videos, which are collected by experts and annotated at the pixel level.Following [5], we consider three progressively more difficult semantic segmentation tasks on this dataset.Specifically, task 1 contains 8 classes, with 4 for anatomical structures, 1 for all instruments and 3 for other objects that appear in frames; task 2 contains 17 classes, which divides the single instrument classes in task 1 into 10 more specific classes of instruments; task 3 contains 25 classes, where instruments are further subdivided according to the handles and parts of certain instruments.
To thoroughly demonstrate the efficacy of the proposed method, we compare our method with state-of-the-art competing methods on all three tasks with varying levels of labeled data.This task poses some unique challenges for segmenting the anatomical structures, surgical instruments and other objects (i.e., surgical tapes, hands and eye retractors) simultaneously.In particular, classes are unevenly distributed, and some objects are either thin or small, or both.

Late Gadolinium Enhancement MRI
To further demonstrate the performance of our method on 3D medical image segmentation tasks, we also evaluate our method on the Late Gadolinium Enhancement MRI (LGE-MRI) dataset [48].This dataset is a collection of 154 3D LGE-MRIs acquired from the Left Atrial Segmentation Challenge, which contains data from 60 patients with atrial fibrillation prior to and post-ablation treatment.The goal is to perform Left Atrium (LA) segmentation, and the images have an isotropic resolution of 0.625 × 0.625 × 0.625 mm 3 .In particular, it is challenging to segment the LA in the top and the bottom slices, corresponding to the pulmonary veins and the mitral valve, respectively.

ACDC
In addition, we evaluate the performance of our method on another 3D medical image segmentation task using the ACDC (Automated Cardiac Diagnosis Challenge) 2017 dataset [1].This dataset consists of MRI images from 100 patients with expert annotations.Among these, 2 are used as the validation set, 20 are used as the test set and the rest are used as the training set.The ACDC dataset contains images from patients with normal cardiac anatomy as well as those with previous myocardial infarction, dilated cardiomyopathy, hypertrophic cardiomyopathy and an abnormal right ventricle.One unique challenge in this task lies in identifying the left ventrice, the myocardium and the right ventrice, which could all be very small at certain anatomical structures such as the apex.

Evaluation Metrics
This section presents the evaluation metrics employed to assess the effectiveness of the proposed approach.Firstly, we follow [29] and use the mean Intersection over Union (mIoU) to evaluate the CaDIS dataset, given as follows: where c is the number of classes, and TP, FP and FN denote the number of pixels that are true positives, false positives and false negatives, respectively.In addition, Dice score, Jaccard, the average surface distance (ASD) and the 95% Hausdorff Distance (95HD) are adopted to evaluate LGE-MRI [49].Also, methods are evaluated with Dice score and 95HD for ACDC 2017.Specifically, Dice, Jaccard and ASD are given as follows: where S(•) denotes the set of surface voxels, and the two sets A and B refer to the ground-truth and network prediction, respectively.|| • || denotes the L2 distance.For the Hausdorff distance, it could be written as: where h (S(A), S(B)) = max ||a − b|| , denoting the maximum distance of voxels in A to the nearest voxel in B. The 95% Hausdorff Distance is based on the 95th percentile of the surface distance above in order to eliminate the impact from a small number of outliers.

Implementation Details
For CaDIS, we use the publicly available split strategy [29] of 3550 frames for training and the remaining 1120 for validation.For 12%, 24% and 49% labeled data in our experiments, we randomly choose 424, 834 and 1729 labeled frames, respectively.In addition, we randomly crop the original image to 352 × 352 and augment the data in the same way as in [29], including random horizontal flipping, Gaussian noise and color jittering.To mitigate the effects of data imbalance, we also follow [29] and use Repeat Factor Sampling in the labeled set.For LGE-MRI, we are consistent with [49]'s pre-processing scheme, dividing the original 100 sheets of data into 80 for training and 20 for validation.All data are centrally cropped in the heart region and randomly cropped to 112 × 112 × 80.The data augmentation strategy used included random flips and rotations of 90, 180 and 270 degrees.For the ACDC 2017 dataset, we use random flip, rotation and scaling as data augmentation.
The proposed method is implemented with PyTorch [26].We use an nVIDIA GeForce RTX 3090 GPU for training.We choose to train CaDIS and ACDC datasets with EfficientNet-B3-based [39] Unet [31] and LGE-MRI with Vnet [23] as backbone architectures.Both networks are updated using the SGD optimizer with a momentum of 0.9 and an initial learning rate of 0.0001, with the learning rate divided by 0.001 1/epoch for each epoch, where epoch depends on the ratio of iterations to the number of labeled data in training, and the batch size is set to 8.For CaDIS, we set the number of iterations in one epoch to 10,000 when the ratio of labeled data are 12%, 20,000 when the ratio of labeled data are 24% and 50,000 when the ratio of labeled data are 49%; for training LGE-MRI, the number of iterations is set to 6000 uniformly; for ACDC, the number is 30,000 uniformly.

Results on CaDIS
In order to better verify the effectiveness of the proposed method, we compare it with the following state-of-the-art methods: URPC [21], UAMT [49], CLD [19] and CPS [7].Among these methods, URPC [21], UAMT [49] and CLD [19] are specifically designed for semi-supervised medical image segmentation, while CPS [7] is a generic semantic segmentation algorithm.We also experiment with a naive setup, i.e., to train the model with only labeled data, denoted as Baseline.As shown in Table 1, Co-DA outperforms other methods on different split settings.Specifically, in Task 1, our method is able to improve 0.1223, 0.1017 and 0.0329 on mean IoU with 12%, 24% and 49% labeled data, respectively.For Task 2 and Task 3, the corresponding improvements are 0.1304, 0.1228, 0.1832 and 0.0845, 0.1181, 0.1189.It should be noted that under Task 1 with 12% labeled data, the performance of our method is slightly lower than UAMT for semi-supervised medical image segmentation and CPS for generic image segmentation, which is probably due to the data distribution being relatively balanced when the amount of data and number of classes are small.However, as the problem gets more difficult when we have more imbalanced datasets with more classes, the performance gain obtained from our method becomes more prominent.[21] 0.6649 0.4449 0.3169 0.7486 0.5383 0.3886 0.8361 0.6328 0.5414 UAMT [49] 0.7223 0.2953 0.2180 0.7760 0.5288 0.4085 0.8811 0.6925 0.5815 CPS [7] 0.7222 0.4570 0.3525 0.8437 0.6562 0.4821 0.8874 0.7774 0.5837 CLD [19] 0.7120 0.4465 0.3056 0.8376 0.6474 0.4325 0.8857 0.7553 0.5999 Co-DA (Ours) 0.7196 0.5023 0.3629 0.8515 0.6641 0.5040 0.8932 0.7898 0.6475 In Figure 3, we show the qualitative results of Co-DA.The comparison between Co-DA and CLD [19] shows that Co-DA could obtain superior results, especially in the instrument regions.

Results on Late Gadolinium Enhancement MRI
Similar to our setup in the CaDIS experiment, the quantitative results of the experiments on Late Gadolinium Enhancement MRI (LGE-MRI) with 10% and 20% labeled data are shown in Table 2, where we also present the experimental results for all the competing algorithms.Our proposed Co-DA consistently provides a strong performance among all algorithms.With 10% labeled data, Co-DA boosts 21.78% and 26.54% on Dice and Jaccard over the baseline, and reduces 2.86 voxels and 10.7 voxels on ASD and 95HD, respectively.With 20% labeled data, Co-DA consistently provides the best performance, improving 25.3% and 29.48% on Dice and Jaccard over the baseline, and reduces 5.54 voxels and 16.08 voxels on ASD and 95HD, respectively.We note that our method demonstrates the strongest performance on Dice score and Jaccard under all settings.Furthermore, we present the trends in mIoU with 20% data  Figure 5 shows some examples of the segmentation results on LGE-MRI, using the proposed Co-DA and another state-of-the-art method, CLD [19].CLD tends to have difficulty in accurately segmenting the pathological regions and will partially misjudge the healthy regions, while our segmentation results of the pathological regions are much closer to the ground-truth.In addition, our method is less likely to generate spurious artefacts commonly found in the results obtained with CLD, showcasing the strong regularization ability of class-wise distribution alignment.

Results on ACDC
According to Table 3, the performance of our method on the ACDC dataset is better than or comparable to other competing algorithms, which shows that our method is also effective in this segmentation task.With 10% labeled data, our method is able to outperform the baseline with a 12.35% margin on Dice and a 6.18 voxels improvement on 95HD.With 20% labeled data, these performance boosts are 10.3% and 4.63 voxels, respectively.It is clear that our method provides the best performance in terms of 95HD under both settings.However, our method, along with other methods for semi-supervised medical image segmentation, displays a slightly poorer performance than CPS on Dice score, indicating the excellent performance of CPS and showing that distribution alignment for pseudo-labels   does not work well enough on the ACDC dataset, which we will continue to explore in future work.Although our method does not provide the best Dice score, the performance in terms of 95HD is better than all other algorithms, demonstrating that Co-DA is able to provide high quality segmentation with accurate boundary delineation.We present example segmentation results of CLD and our proposed Co-DA in Figure 6.

Ablation Studies
In this section, we perform additional experiments to verify the strong learning capacity Co-DA as compared to its fully supervised variant.In addition, we isolate the different components of our method to ensure that they all contribute to the final performance.Finally, we look into the efficacy of the dynamic threshold for the over-expectation cross-entropy loss and compare examples of pseudo-labels generated with and without Co-DA.
Firstly, we compare Co-DA with the fully supervised settings on both CaDIS and LGE-MRI and present the results in Figure 7.In many cases, the performance of Co-DA is very close to the fully supervised setting, demonstrating the outstanding efficacy of class-wise distribution alignment.Specifically, with 49% labeled data, Co-DA provides only slightly lower mIoU on CaDIS Tasks 1 and 2 as compared to the fully supervised setting.It even marginally outperforms the fully supervised case on Task 3, which means the refined pseudo-labels have a superior quality.
Figure 7: Comparing the baseline, Co-DA, and the fully supervised method on the CaDIS dataset in terms of mIoU.In some cases, Co-DA performs comparably to the fully supervised method, demonstrating its strong learning capacity.
Secondly, to ensure the complementary effect of proposed components in our method, i.e., the co-distribution alignment and the over-expectation cross-entropy loss, we conduct a set of ablation experiments on CaDIS, and the results are presented in Table 4.The third row demonstrates that the over-expectation cross-entropy loss reduces the noise in pseudo-labels by dynamically varying the threshold to filter out low-confidence pixels, achieving a significant improvement of 0.0488 on Task 3 for mean IoU over the baseline.The fourth row shows that the proposed Co-DA can bring a gain of 0.1303 on Task 3, while there is a slight decrease in performance on Task 2. Empirically, we conjecture the reason is that Task 1 and Task 3 have more severe class imbalance than Task 2 [5].Considering the overall stability of the proposed method, we apply the two newly designed modules to co-training simultaneously, obtaining improvements of 0.0058, 0.0124 and 0.0748 on the three tasks over the baseline, respectively.Finally, according to Figure 9, different predefined thresholds for the over-expectation loss will also lead to different levels of performance.Yet, a dynamic threshold as provided by Equation ( 9) is consistently superior, as it is able to automatically select a suitable gating value for retaining reliable pseudo-labels.
Figure 9: A performance comparison of static thresholds versus the dynamic threshold that our method adopts according to Equation 9on Task 1 of the CaDIS dataset with 24% labeled data.It is clear that the dynamic threshold provides superior mean IoU.

Conclusions
In this paper, we propose a novel Co-Distribution Alignment (Co-DA) approach to partially labeled medical image segmentation tasks with imbalanced class distributions.Our key idea involves a class-dependent alignment between labeled distributions and unlabeled marginal distributions that is suitable for dense prediction tasks.Specifically, Co-DA simplifies the estimation of labeled distributions, extends the original DA to class-wise distribution alignment with cross-supervision and adopts adaptive temperature scaling for labeled distributions to avoid highly imbalanced estimations.In addition, we propose an over-expectation cross-entropy loss to reduce noises in pseudo-labels.Extensive experiments, including abation studies, on three publicly available datasets demonstrate the consistently superior learning capacity of our approach.

Figure 1 :
Figure 1: Illustration of a comparison of Co-Distribution Alignment (Co-DA) with Distribution Alignment (DA) and co-training.(a) is the DA in ReMixMatch [2], (b) is co-training [7], (c) is the proposed Co-DA.M1 and M2 here stand for two differently initialized networks.The dark blue circles and squares denote the output and the pseudo-labels of M1.The light blue circles and squares denote the output and the pseudo-labels of M2.

Figure 2 :
Figure2: Overview of Co-DA for semi-supervised medical image segmentation, which consists of a stream for labeled data and a stream for unlabeled data.F 1 and F 2 are two parallel networks with the same architecture but different initializations.For labeled data, the standard cross-entropy loss (CE Loss) is used.For unlabeled data, ŷu1 and ŷu2 are pseudo-labels given by Equation (8), and we use the aligned output distribution from one network to supervise the other one.For the over-expectation cross-entropy loss (O-E Loss) in Equation(10), p u = max(F(x u )) denotes the class probability corresponding to the pseudo-labels, t stands for the threshold given by Equation (9) and the O-E Masks are the thresholded binary masks to filter out pixels with low confidence in their pseudo-labels.We use class-wise DA to align marginal distributions for each class to better deal with imbalanced datasets.Here, M l and M u denote the labeled and the unlabeled marginal distributions.

Figure 3 :
Figure 3: Comparison of the proposed Co-DA and another state-of-the-art method, CLD [19], on CaDIS with 49% labeled data.(a) is the input to the networks, (b) is the ground-truth of the corresponding input, (c) is the output of CLD, and (d) is the output of Co-DA.

Figure 4 :
Figure 4: Trends in mIoU on the LGE-MRI dataset with 20% labeled data.It is clear that Co-DA not only provides the best performance but also trains more quickly than most other methods, requiring a smaller number of training iterations.Also, the performance is stable once peaked.

Figure 5 :
Figure 5: Comparison of the proposed Co-DA and another state-of-the-art method, CLD [19] on 2D slices of MRI on LGE-MRI with 20% labeled data, where (a) is the input to the networks, (b) is the corresponding grouth-truth, (c) is the output of CLD (d) is the output of Co-DA.

Figure 6 :
Figure 6: Comparison of the proposed Co-DA and another state-of-the-art method, CLD [19], on the ACDC dataset with 10% labeled data, where (a) is the input to the networks, (b) is the corresponding ground-truth, (c) is the output of CLD and (d) is the ouput of Co-DA.

Table 4 :
Results from our ablation study on CaDIS.Performance shown in mean IoU.O-E denotes the proposed over-expectation cross entropy loss.Co-training DA O-E Co-DA 49% Labeled Data Task1 Task2 Task3 0.8874 0.7774 0.5837 0.8893 0.7718 0.6438 0.8903 0.7824 0.6325 0.8907 0.7681 0.7140 0.8932 0.7898 0.6585We also visualize examples of the pseudo-labels generated by the networks with and without Co-DA, and the results are shown in Figure8.Clearly, Co-DA is able to improve the quality of pseudo-labels, resulting in an overall accuracy improvement and a reduction of the negative impact from incorrect pseudo-labels.

Figure 8 :
Figure 8: Visualization of pseudo-labels generated by Co-DA.(a) is the input image, (b) is the Ground-Truth, (c) is pseudo-labels generated by the network without Co-DA, and (d) is the pseudo-labels refined by Co-DA.

Table 1 :
Mean IoU results on CaDIS.Bold numbers represent the best performance.

Table 2 :
Experimental results on LGE-MRI.Bold numbers represent the best performance.

Table 3 :
Experimental results on ACDC.Bold numbers represent the best performance.