Learning Deep Representations of Cardiac Structures for 4D Cine MRI Image Segmentation through Semi-Supervised Learning

Learning good data representations for medical imaging tasks ensures the preservation of relevant information and the removal of irrelevant information from the data to improve the interpretability of the learned features. In this paper, we propose a semi-supervised model—namely, combine-all in semi-supervised learning (CqSL)—to demonstrate the power of a simple combination of a disentanglement block, variational autoencoder (VAE), generative adversarial network (GAN), and a conditioning layer-based reconstructor for performing two important tasks in medical imaging: segmentation and reconstruction. Our work is motivated by the recent progress in image segmentation using semi-supervised learning (SSL), which has shown good results with limited labeled data and large amounts of unlabeled data. A disentanglement block decomposes an input image into a domain-invariant spatial factor and a domain-specific non-spatial factor. We assume that medical images acquired using multiple scanners (different domain information) share a common spatial space but differ in non-spatial space (intensities, contrast, etc.). Hence, we utilize our spatial information to generate segmentation masks from unlabeled datasets using a generative adversarial network (GAN). Finally, to reconstruct the original image, our conditioning layer-based reconstruction block recombines spatial information with random non-spatial information sampled from the generative models. Our ablation study demonstrates the benefits of disentanglement in holding domain-invariant (spatial) as well as domain-specific (non-spatial) information with high accuracy. We further apply a structured L2 similarity (SL2SIM) loss along with a mutual information minimizer (MIM) to improve the adversarially trained generative models for better reconstruction. Experimental results achieved on the STACOM 2017 ACDC cine cardiac magnetic resonance (MR) dataset suggest that our proposed (CqSL) model outperforms fully supervised and semi-supervised models, achieving an 83.2% performance accuracy even when using only 1% labeled data. We hypothesize that our proposed model has the potential to become an efficient semantic segmentation tool that may be used for domain adaptation in data-limited medical imaging scenarios, where annotations are expensive. Code, and experimental configurations will be made available publicly.


Background and Problem Statement
The emerging success of deep convolutional neural networks (CNNs) has rendered them the de facto model in solving high-level computer vision tasks [1][2][3]. However, such approaches mostly rely on large amounts of annotated data for training, the acquisition of which is expensive and laborious, especially for medical imaging/diagnostic radiology data. To address the need for high performance, there has been a growing trend in using a limited amount of annotated data along with an abundance of unlabeled data in a semi-supervised learning (SSL) setting.
The recent dominant body of research that has proposed SSL methods in deep learning features various approaches, including an auxiliary loss term defined on un-annotated data (consistency regularization) [4,5], adversarial networks [6], generating pseudo-labels [7,8] based on model predictions on weakly augmented unannotated data, self-training [9,10], adversarial learning [11] and domain adaptation [12]. Here we acknowledge their latest accomplishments in the field of domain adaptation, semi-supervised learning and interpretable representation learning by disentanglement and briefly discuss some of their yet outstanding limitations.
Variational autoencoder-based models: There have been several recent works involving disentangled learning with variational autoencoder (VAE) [24,40,41]. In contrast to these previous works, we will attempt to demonstrate the use of a VAE as a disentangled representation by sampling the sentiency code to separate the domain-specific information from the domain-invariant latent code.

Overview of the Proposed Method
To further address some of the shortcomings associated with existing methods, our efforts focus on learning meaningful spatial features utilizing a disentangler with a mutual information minimizer (MIM) to improve the adversarially trained generative models for improving semi-supervised segmentation and reconstruction results.
Our proposed method builds on several recent and key research findings in the fields of generative models, semi-supervised learning, and representation learning via disentanglement. We believe that the proposed framework's reliance on as little as 1% labeled data for training, in concert with the high segmentation accuracy achieved, comparable to the fully or semi-supervised models, renders the proposed work an attractive solution for medical image segmentation, where access to vast expert-annotated data is expensive and often difficult to gain access to.
We approach this problem using a method that is based on disentangled representations and utilizes data from multiple scanners with varying intensities and contrast ( Figure 1). Our method is intended to address multi-scanner unlabeled-data issues, such as intensity differences, and a lack of sufficient annotated data. Learning good data representations for medical imaging tasks ensures the preservation of relevant information and the removal of irrelevant information from the data to improve the interpretability of the learned features. Our model disentangles the input image into spatial and non-spatial space. These spatial features are represented as categorical feature maps, with each category corresponding to input pixels that are spatially similar and are from the same organ part. This semantic similarity aids in learning to be generalized the anatomical representation to any modality from different scanners. Furthermore, the non-spatial features capture the image's global intensity information, which aids the renderer in painting the anatomy in the reconstructed image. Finally, because annotating data is time-consuming and expensive, the ability to learn this decomposition through disentanglement using a small number of labels is critical in medical image analysis.
In light of these needs, here we propose a semi-supervised (CqSL) model for learning disentangled representations that combines recent developments in semi-supervised learning-generative models and adversarial learning. We aim to factorize the representation of an image pair into two parts: a shared representation that captures the common information between images and an exclusive representation that contains the specific information of each image. Furthermore, in order to achieve representation disentanglement, we propose to minimize mutual information between shared and exclusive representations. Moreover, we use feature-wise linear modulation (FiLM) [38] to distinguish the domaininvariant information from the domain-specific information, as well as a spatially adaptive normalization (SPADE) [39]-based decoder to guide the synthesis of more texture information to restrain the posterior collapse of the VAE and spatial information.
To illustrate its adequacy, our model is applied to two of the foremost critical tasks in medical imaging-segmentation of cardiac structures and reconstruction of the original image-and both assignments are handled by the same model. Our model leverages a large amount of unannotated data from the ACDC (https://www.creatis.insa-lyon.fr/Challenge/ acdc/databases.html, accessed on 2 October 2021) dataset to learn the interpretable representations through judicious choices of common factors that serve as strong prior knowledge for more complicated problems-the segmentation of cardiac structures. Figure 2 shows a simplified data view of our proposed model.

Contributions
Our proposed work makes several contributions summarized as follows:

1.
We combine recent developments in disentangled representation learning with strong prior knowledge about medical imaging data that features a decomposition into "skeleton (spatial)" and "sentiency (non-spatial)", to ensure that the spatial information is not mixed up with the non-spatial information.

2.
We alter the usual cross-entropy loss to down-weigh the loss applied to well-classified samples in order to overcome the foreground-background class imbalance problem. Specifically, we exploit a novel supervised loss-the weighted-soft-background-focal (WSBF) loss, which focuses the training on a set of hard examples to ensure that this loss can differentiate between easy/hard examples.

3.
We employ both qualitative and quantitative tests to evaluate the usefulness of our framework, which show that our model outperformed fully supervised methods, even when using only 1% labeled data for training.
The paper is organized as follows: Section 1 establishes the general background and motivation of the work, reviews the related literature on latest developments in the field of domain adaptation, semi-supervised learning and representation learning, and provides an overview of the proposed work; Section 2 describes our proposed methodology; Section 3 presents our quantitative and qualitative results achieved using our proposed method for both image segmentation and reconstruction, along with the associated ablation studies; Section 4 concludes the paper with a summary of our contributions and promising future research directions.

CqSL Model Overview
We propose a model that combines the concept of variational generative and adversarial learning, and disentangled interpretation learning in a semi-supervised learning scheme, which is suited for domain-adapted segmentation as well as reconstruction.
We define the learning task as follows: given an (unknown) data distribution p x, y over images and segmentation masks, we define a source domain having a training set, n ul with n ul unlabeled examples, which are sampled as independent, identically distributed variables from p x, y and p x distribution. Empirically, we want to minimize the target risk ∈ t ϕ, θ = min ϕ, θ ℒ ℒ D ℒ , ϕ, θ + γℒ Uℒ D Uℒ , ϕ, θ , where ℒ ℒ is the supervised loss for segmentation, ℒ Uℒ is the unsupervised loss defined on unlabeled images and ϕ, θ denotes the learnable parameters of the overall network.
We propose to solve the task by learning domain-specific and domain-invariant features that are discriminative of the semgentor and reconstructor. Figure 3 shows the proposed model comprised of five components-(1) disentanglement component, (2) a disentangled variational autoencoder (DVAE), (3) a mask segmentor identifier (SI), (4) a mask discriminator identifier (DI), and (5) a reconstructor R.
The disentangler D (Figure 3a) is designed to factorize the representation of an image pair into two parts: a shared spatial representation (skeleton, SK e ) that captures the common information between images and an exclusive non-spatial representation (sentiency, S e ) that contains the specific information of each image. The skeleton block SK e is a modified U-Net++ [42] type architecture (EPU-Net++) ( Figure 4 and Section 2.1.1) and is responsible for capturing the domain-invariant features f SK . The sentiency block S e is a DVAE ( Figure   3b) type architecture, which takes both the input image and the domain-invariant features f SK as the input to map domain-specific features f SE using the reparameterized trick [43].
The reconstruction block consists of two decoders: the SPADE-based decoder takes the f SE feature from the sentiency block and proceeds directly to the reconstructor R ( Figure   3d), while the FiLM-based decoder works as another disentangler, which untangles a segmentor identifier (SI) (Figure 3c), used for segmentation and extracted features, which then proceed directly to the reconstructor R. The reconstructor R aims to recover the original image from both (f SK , f SE . A mutual information minimizer (Figure 3a block) is applied between SK e and S e to enhance the disentanglement. A supervised trainer is trained on the labeled data to predict the segmentation mask distribution optimizing a supervised loss. An unsupervised trainer is trained on the unlabeled data, optimizing unsupervised losses (Algorithm 1 specifies the overall training procedure). Both the unsupervised and supervised trainers share the same block, as mentioned above. Figure 3a, the disentangler block factorizes the image features into spatial (skeleton/physique) features, as well as non-spatial (sentiency) features that carry residual information. The skeleton block is a modified U-Net type architecture-EvoNorm-Projection-UNet++ (EPU-Net++) as shown in Figure 4. We attach eight different decoders at the common bottleneck layer of EPU-Net++. Each decoder captures bottleneck features from 2D cropped images and transforms them into different feature maps consisting of a number of binary channels which are then combined together to form eight most effective channels: x ST 0,1 ℎ × w × c ∑ i = 1 i = 8 f SK i . These feature maps are responsible for capturing the domain-invariant features and contain cardiac structures (myocardium, the left and the right ventricle), effective for segmentation and some surrounding structures, effective for reconstruction ( Figure 5).

Disentanglement-Referring to
We use a separate neural network for capturing the sentiency information i.e., domainspecific information. We combine the crop image and the domain-invariant features to penalize the deviation of latent features from the prior distribution employing Kullback-Leibler divergence by applying a VAE architecture ( Figure 3b) with the following objective function: A VAE learns a low dimensional latent space such that the acquired latent representations fit a prior distribution that is predetermined to be an isotropic multivariate Gaussian p z = N 0,1 . An encoder and a decoder make up a VAE. Given an input, the encoder guesses the Gaussian distribution's parameters. In order to enable learning through back propagation, this distribution is then sampled using the reparameterization technique, and the resulting sample is sent through the decoder to reconstruct the input.
We use disentangled features as the prior distribution in a VAE (Equation (1)) to remove class-irrelevant features (e.g., background pixels) and ensure that domain-invariant features are well-disentangled from class-specific features, because the image-only a priori aligns the latent features to a normal distribution.

Mutual Information
Minimizer-To better exploit the disentanglement, we add a regularization term based on mutual information (MI), denoted as MIM, which measures the "amount of information" learned from knowledge of random variable Y about the other random variable X [44]. In this paper, we adopt the mutual information neural estimator (MINE) [45], MI f SK , f SE : where α, β are sampled from the joint distribution of f SK , f SE and β′ is sampled from the marginal distribution.
The mutual information can be expressed as the difference of two entropy terms MIM X; Y = H X − H X | Y ; we seek to minimize the MI between domain-invariant and domain-specific features f SK , f SE , whereas we make an assumption that the information content does not vary much between intra-domains ( Figure 3a).

Segmentation-The
where α 0 and α 1 are designed to account for class imbalance and are treated as hyperparameters, the term y − ŷ γ is used to down-weigh examples with backgrounds, where γ varies in the range [1,3]. The term CE y, ŷ = − ylog ŷ − 1 − y log 1 − ŷ denotes the cross-entropy loss.
On the other hand, the data with no corresponding segmentation masks are trained by minimizing the unsupervised loss via a KL divergence based on least-squares GAN [46]. However, since the least-squares loss is not sufficiently robust, we introduce a new divergence loss function by incorporating it into a Geman-McClure model [47] fashion called adversarial-Geman-McClure (adv-GM) loss between the ground truth of real mask y l and prediction on unlabeled data y ul : where β is the scale factor which varies in the range of [0, 1] and we set β = 0.5 in our experiment.

Image
Reconstruction-To better capture the anatomical shape and the intensity information in the synthetic image, we propose a two-branched reconstruction architecture featuring two separate decoders: one is conditioned with FiLM [38], and the other with SPADE [39] ( Figure 6a) and both are then concatenated to produce a realistic image. The FiLM decoder consists of multiple FiLM layers, a gamma-beta predictor, and convolutional layers with 3 × 3 kernel and (8,8,8,8,1) channels in the stride of 1. Each convolution layer is followed by batch normalization layer along with a Leaky-ReLU layer.
To better retain the non-spatial information in the MR image, we integrate the shape knowledge into the idea of SPADE [39] and form a shape-aware normalization layer (see Figure 6). SPADE first normalizes the input feature F in with a scale α and a shift μ learned from sampled z using an instance-normalization (InstanceNorm) layer, inspired by [38] and then denormalizes it based on a spatial representation f SK through learnable parameters γ and β . f SK is then interpolated to match the texture dimension of the sampled z from the sentiency encoder and used as a semantic mask for SPADE: where F in and F out denote the output feature maps. γ and β are learned from f SK by three Conv layers. Thus, the learned shape information precludes washing away the anatomical information, which encourages the image synthesis to be more accurate. The first convolution layer inside the SPADE block ( Figure 6) encodes the interpolated f SK , and the other two convolution layers learn the spatial tensors γ and β. Simultaneously, an instance normalization layer is applied to the intermediate feature map, which is then modulated by the scale and shift parameters γ and β learned from sampled z to produce the output. Finally, the output of the two decoders is re-entangled in order to reconstruct an image.

Objective Functions
The training objective function consists of multiple losses for labeled and unlabeled data, each weighted by some scalar term λ: (6) where λ t is the weight for the loss of type t. In this paper, we empirically set the weights as λ vae = 0.01, λ seg = 10, λ adv-GM = 10, λ SL 2 SIM = 0.01, λ MIM = 1.

Segmentation
Loss-Since the model is trained on both labeled and unlabeled data, the segmentation loss L seg includes both supervised and unsupervised losses: Supervised Loss.: Our supervised cost is based on the combination of the two following functions: (1) the weighted soft focal loss, and (2) the background focal dice loss mentioned in Equation (3) Similarly, for the unlabeled data, the adversarial loss is VAE Loss.: For the smooth texture detail of the input data, the VAE learns factorized representations to optimize a KL-divergence loss, given an image x i ul , and its decomposed skeleton feature f SK (Equation (1)).

Reconstruction
Loss-We adopt a novel reconstruction loss as a combination of structural similarity (SSIM) and L 2 loss-SL 2 SIM in order to enforce the similarity between recovered image and original image for better learning the distribution of images.
Hasan and Linte Page 10 SL 2 SIM Loss.: Since the image intensities vary across imaging scanners, as a result, there are high chances that the generative model will tend to mode collapse. This structural L 2 similarity (SL 2 SIM loss provides a similarity measure between the input image and the reconstructed image based on high light-dark variance, contrast, and structural similarity. The concatenated FiLM and SPADE decoder learn the parameters to reconstruct the input image using a novel combination of structured similarity loss and L 2 loss. For labeled data, the reconstruction loss is Similarly, for unlabeled data, the reconstruction loss is where SL 2 SIM is the structure similarity index term and α is a regularized term.  [48]. The images were acquired over a 6 year period using two MRI scanners of different magnetic strengths (1.5 T and 3.0 T). The images were acquired using the SSFP sequence with spatial resolution 1.37 to 1.68 mm 2 /pixel and 28 to 40 frames per cardiac cycle. We split the dataset into three sets-training (70), validation (15), and test (15).

Implementation Details
Input: All the cine cardiac images employed slice-wise normalization in the range [0, 1] by subtracting the mean slice intensity from each pixel intensity, then dividing it by the difference between the maximum and minimum slice intensity. All images were resampled to 1.37 mm 2 /pixel. Images were cropped to 192 × 192 × 1 pixels before feeding to the models. We applied data augmentation on-the-fly during training as shown in Figure 7, which includes random rotations up to 90 degrees, random zooms up to 20%, random horizontal shifts up to 20%, random horizontal and/or vertical flips, and noise addition ( Figure 7).
Baselines Architecture: As the disentangled encoder in the skeletal block, we use a modified U-Net-like architecture, EPU-Net++, and as a sentiency encoder, we use VAE. As the reconstruction block, we use FiLM-and SPADE-based decoder as used in [49].
Generator-Discriminator Network: Our segmentation generator network consists of 3 convolution layers with 3 × 3 kernel and {64, 64, 1} channels in the stride of 1. Each convolution layer is followed by a batch normalization [50] layer along with a Leaky-ReLU [51] except the last layer. We use the structure similar to DCGAN [52] for the discriminator network.

EvoNorm-Projection skip connections:
In our skeleton encoder, we replace the standard skip connection with a normalized-projection operation using EvoNorm 2D + 1 × 1 − Conv + Gaussian -dropout, as in Figure 4. This new normalization layer adds together two types of statistical moments-batch variance, and instance variance, both of which capture both the global and local information across images without having any explicit activation function [53]. The proposed projection operation helps in reducing the learnable weights and also allows intricate learnability of cross-channel information.

Additional Factors:
The performance of semi-supervised models trained for image segmentation can be significantly impacted by the proper selection of regularizer, optimizer, and hyper-parameters. The model implemented in Keras was initialized with the He normal initializer and trained for 100 epochs with a batch size of 4. We trained all the components iteratively with the Adam optimizer with a 0.0001 learning rate to minimize the objective function. All experiments were conducted on a machine equipped with two NVIDIA RTX 2080 Ti GPU (each 11GBs memory). The detailed training procedure is presented in Algorithm 1.
Training: In our semi-supervised setup, we trained the network on varying proportions of labeled data: 1%, 10%, 20%, 30%, 50%, and 90% as a labeled set and used the rest of the data as the training unlabeled set to hold D ℒ ≤ D Uℒ . In Section 3, we include an ablation study to investigate the importance of adding different loss components in our model CqSL Here, we utilize the same backbones as the baselines with the only exceptions being different loss functions. To clarify our point, in 1 CqSL, we removed the weighted soft focal loss (WSFL) from the weighted soft background focal loss (WSBF), while keeping the background focal dice loss (BFD), mutual information minimizer loss (MIM) and adversarial-Geman-McClure (adv-GM) the same as before. In 2 CqSL, we removed our Geman-McClure version of adversarial loss, while keeping the regular adversarial loss, weighted soft background focal loss (WSBF), and mutual information minimizer loss (MIM) the same as before. Similarly, in 3 CqSL, we used DICE + CE loss rather than using our novel weighted soft background focal loss (WSBF) while keeping the mutual information minimizer loss (MIM) and adversarial-Geman-McClure (adv-GM) the same as before. Finally, in 4 CqSL, we removed our mutual information minimizer loss (MIM) loss, while keeping the weighted soft background focal loss (WSBF), and adversarial Geman-McClure (adv-GM) the same as before. Additionally, the sentiency block, S e and the skeleton block, SK e were in place. We evaluated the performance of all four CqSL semisupervised variants as summarized in Tables 1-3 in the Results section, and, as illustrated later, the 1 CqSL variant performed best, but for the sake of consistency, we asses and compare the performance of all four implemented variants.

Evaluation Metrics
To evaluate the performance of the semantic segmentation of cardiac structures, we use the standard metrics, including Dice score, Jaccard index, Hausdorff distance (HD), precision (Prec), and recall (Rec).

Dice and Jaccard Coefficients:
The Dice score is used to measure the percentage of overlap between manually segmented boundaries and automatically segmented boundaries of the structures of interest. Given the set of all pixels in the image, set of foreground pixels by automated segmentation S 1 a , and the set of pixels for ground truth S 1 g , the Dice score can be compared with S 1 a , S 1 g ⊆ Ω, when a vector of ground truth labels T 1 and a vector of predicted labels P 1 as Dice T 1 , P 1 = 2 T 1 ∩ P 1 T 1 + P 1 (12) The Dice score will measure the similarity between two sets, T 1 and P 1 , and T 1 denotes the cardinality of the set T 1 with the range of D T 1 , P 1 ϵ 0,1 .
The Jaccard index or Jaccard similarity coefficient is another metric which aids in the evaluation of the overlap in two sets of data. This index is similar to the Dice coefficient but mathematically different and typically used for different applications. For the same set of pixels in the image, Jaccard index can be written by the following expression: Jaccard T 1 , P 1 = T 1 ∩ P 1 T 1 + P 1 (13) 2. Precision and Recall-Precision and recall are two other metrics used to measure the segmentation quality which are sensitive to under-and over-segmentation. High values of both precision and recall indicate that the boundaries in both segmentation agree in location and level of detail. Precision and recall can be written as Precision = T P T P + F P (14) Recall = T P T P + F N (15) where TP denotes true positive rate when a prediction-target mask pair has a score which exceeds some predefined threshold value; FP denotes the false positive rate when a predicted mask has no associated ground truth mask; and FN denotes the false negative rate when a ground truth mask has no associated predicted mask.

Hausdorff distance (HD):
Hausdorff distance (HD) measures the maximum distance between the two surfaces. Let S A and S B be surfaces corresponding to two binary segmentation masks, A and B, respectively. The Hausdorff distance (HD) is defined as 16) where d p, S = minqϵSd p, q is the minimum Euclidean distance of point p from the points q ϵ S.

Image Quality Metrics:
PSNR: The peak signal-to-noise ratio (PSNR) is the most commonly used quality assessment technique for determining the quality of lossy image compression codec reconstruction. The signal is the original data, and the noise is the error caused by the distortion.

Clinical Indices:
To assess the performance of the ventricles, different indices have been used in the literature [54], such as left ventricular volume (LVV), left ventricular myocardial mass (LVM), stroke volume (SV), and ejection fraction (EF). The left ventricular volume (LVV) is defined as the volume enclosed by the LV blood pool and the myocardial mass is equal to the volume of the myocardium, multiplied by the density of the myocardium: Myo-Mass = Myo-Volume cm 3 × 1.06 gram/cm 3 (17) Stroke volume (SV) is defined as the volume ejected during systole and is equal to the difference between the end-diastolic volume (EDV) and the end-systolic volume (ESV): SV = EDV − ESV × 100 % (18) The ejection fraction (EF) is an important cardiac parameter quantifying the cardiac output and defined as the ratio of the SV to the EDV: EF = SV EDV × 100 % (19) 3. Results

Image Segmentation Assessment
We tested our CqSL model on varying proportions of labeled and unlabeled data available through the STACOM 2017 ACDC cine cardiac MRI dataset. Training and validation segmentation accuracies for three different classes (RV, LV, and LV-Myo) are shown in Figure 8 for 100 epochs. Note that the validation curves show similar trends as the training curves ( Figure 8).
The CqSL experimental results were compared against a fully supervised U-Net model trained from scratch, as reported in Tables 1-3 Table 4. The segmentation performance is evaluated both qualitatively and quantitatively. As shown in Tables 1-3, our proposed model significantly improves the segmentation performance of right ventricle (RV), left ventricle blood-pool (LV), and LV-Myocardium, respectively on varying proportions of annotated data in terms of the Dice and Jaccard indices, Hausdorff distance, precision and recall rates. Our CqSL model achieves a high dice score (± std. dev.) of 75.50 ± 10.9 % for the RV, 83.21 ± 7.1 % for the LV blood-pool and 77.65 ± 9.3 % for the LV-Myocardium even if we use only 1% labeled data. Figure 9 illustrates a qualitative segmentation output that compared CqSL and two others semi-supervised models, i.e., model I: only a GAN architecture ( Figure 3c); model II: I + reconstruction (Figure 3c,d). For simplicity, this comparison is based on 20% unlabeled training data. As demonstrated, when only 20% of the training annotation is employed, U-Net fails completely to segment the cardiac structures from base to apex, particularly RV segmentation. As shown in the figure, the segmentation results improve with each consecutive addition of a distinct block. The GAN-only architecture performs badly, particularly during RV segmentation, whereas the addition of a reconstruction block improves performance. Finally, adding a disentangled block to the GAN and reconstruction block yielded the greatest results. Even the least performing version of our proposed CqSL model 4 CqSL achieves an overall accuracy superior to the U-Net, GAN-only, as well as GAN+REC model, confirming that the proposed model is able to effectively learn correct features that ensure correct segmentation. Figure 10 illustrates a qualitative segmentation output that compared CqSL and U-Net results with increasing proportion of unlabeled training data. For simplicity, we have shown two of our best performing models. As shown, when only 1% training annotation is used, U-Net completely fails to segment the cardiac structures. Under similar conditions, our model is still able to yield a high segmentation accuracy of LV, RV, and LV-Myocardium. When the amount of labeled data increases from 1% to 10%, the U-Net model still performs poorly, especially for RV segmentation. On the other hand, although the performance of our model improves significantly when utilizing more than 30% annotated data, its performance with even 1% labeled data is still satisfactory, comparable to that of semi-supervised models, and superior to U-Net's performance under similar conditions.
We assessed the performance of our proposed CqSL cardiac image segmentation method against the segmentation results yielded by the well-established, fully supervised U-Net architecture [55] in light of its effectiveness across various medical image segmentation applications, as well as its extensive use as a baseline method for comparison by the participants of the ACDC cardiac image segmentation challenge. Furthermore, to explore the effectiveness of each component in our model, we experiment on three different semi-supervised ablations, i.e., model I: only a GAN architecture; model II: GAN + reconstruction; and model III: GAN + reconstruction + disentangler block (CqSL).
As shown in Figure 11, the accuracy of our CqSL models remains high when using as much as 50-90% unlabeled data, which essentially implies excellent performance with as little as as 10% annotated data. Nevertheless, both U-Net and CqSL models perform similar to each other when the amount of annotated data increases above 90%. We plot the mean accuracy for all the models in Figure 12 and confirm that under low amounts of annotated data conditions, even as low as 1%, our proposed CqSL model and all four of its semi-supervised variants 1 CqSL, 2 CqSL, 3 CqSL, and 4 CqSL outperform GAN, GAN+REC, as well as U-Net models for LV, RV, and LV-Myocardium. The typical segmentation contours of complete cardiac image dataset for the mid and apical slices are shown in Figure 13. Figure 14 illustrates a qualitative comparison between the original image slice and the reconstructed slices generated from our proposed approach on the ACDC dataset at the original 5 mm slice thickness. The comparison is augmented by the computed correlation coefficients (CC) and peak signal-to-noise ratio (PSNR) shown below each figure. As illustrated in Figure 14, our approach preserves the fine structural details and realistic textures while remaining visually comparable to the ground truth image. Aside from qualitative improvements, the proposed method's CC and PSNR values also prove that the synthesized image slices preserve the fine structural details. Table 5 shows the quantitative results of the objective quality metrics of reconstruction, indicating that the use of feature-wise linear modulation to remove domain-invariant information from the disentangled latent code guides the synthesis of more texture information. Starting with the spatial factor, we change the content of the spatial channels in Figure 15 to see how the decoder has learned a correlation between the position of each channel and different signal intensities of the skeleton parts. The sentiency factor remains constant in all of these experiments. The first two columns show the original input and the reconstruction. The third row is created by the RV spatial channels and disregarding (zeroing) the MYO and LV channel. In the fourth image, we swap the RV channels with those of LV. Finally, the fifth column is produced by considering all LV, MYO and RV channels.

Clinical Parameter Estimation
The performance of our developed segmentation method was also reflected in the computed clinical indices. These clinical indices are computed using the Simpsons method and the agreement between the ground truth and the same parameters computed using the automated segmentation results is reported using correlation statistical analysis by mapping the predicted volumes of the testing set onto the ground truth volumes of the training set. As illustrated in Table 6 the agreement between our method's prediction and ground truth is high, characterized by a Pearson's correlation coefficient (rho) of 0.898 p < 0.01 for LV-EF, 0.723 for RV-EF p < 0.1 and 0.924 p < 0.01 for Myo-mass. There was a slight over-estimation in the RV blood-pool segmentation also reflected in the clinical parameters estimation. Figure 16 shows a graphical comparison between the clinical parameters estimated from the cardiac features segmented via CqSL and the same homologous parameters estimated from the ground truth manual segmentations for both healthy volunteers and patients featuring various cardiac conditions. As shown, the clinical parameters estimated using our automatically segmented features show no statistically significant difference from those estimated based on the ground truth, manually segmented features.

Ablation Studies
We perform an ablation study to investigate the effect of using different loss functions in our semi-supervised setting. We demonstrate the effect of different novel loss functions used in CqSL model: WSBF, MIM, and Adv-GM by assessing the model performance when each novel loss functions is removed. Figure 17 shows a graphical representation of the results achieved on the ACDC dataset. In Figure 10, we illustrate the qualitative results on the ACDC dataset to visualize the effect of using all of the loss components. We can observe that the best results are achieved when all of the loss components are used. Specifically, without MIM, the loss curve oscillates, while without WSBF, the output images deviate drastically from the ground truth. Both the quantitative and qualitative results show that the design of CqSL improves the preservation of the subject identity and enables more accurate segmentation of cardiac structures.

Conclusions and Future Work
In this paper, we propose a semi-supervised learning model (CqSL) that features multiple novel loss functions, including mutual information minimization (MIM), which minimizes the mutual information between the domain-invariant as well as domain-specific features. Empirically, we show that disentanglement with mutual information can improve the performance of the segmentation accuracy, while combined with an adversarial and a reconstruction block. Our novel use of total loss function enforces the network to capture both the spatial and intensity information. Our weighted soft focal loss can minimize the class imbalance problem by applying varying weights over different classes along with a modulating term. We apply the proposed model to cardiac image segmentation tasks with varying proportion of labeled data.
Our proposed CqSL model achieves 85% accuracy, significantly outperforming other baselines. We incrementally add each component, aiming to study their effectiveness on the final results: (model I: only a GAN architecture ( Figure 3c); model II: GAN + reconstruction (Figure 3c,d); model III: GAN + reconstruction + disentangled block ( Figure  3a-d).
In light of consistency, all four implemented CqSL variants are evaluated and compared to the baselines, but as shown in Tables 1-3, the first variant 1 CqS performs best and hence it is deemed as the most suitable and recommended CqSL framework.
The experimental results reported in this manuscript show that the proposed CqSL framework outperforms semi-supervised learning with GANs [56] as well as fully supervised-type models when using as little as even 1% labeled data and display similar performance and comparable accuracy when employing more than 50% labeled data. Unlike these, we use adversarial-Geman-McClure (adv-GM) loss to force mask generation to be spatially aligned with the image. Furthermore, we discover that the semi-supervised segmentation approach of Hung et al. [18] obtains results slightly inferior to ours. Hung et al. reported that their adversarial model achieved a 80.63% accuracy when trained on 20% labeled data using the ACDC dataset, whereas our model achieved a 81.44% accuracy under similar training conditions.
Hence, the proposed method is the first to achieve significant performance for 4D cine cardiac MRI image segmentation with very minimal annotated data, specifically 1% of the training dataset. This is a key feature of the proposed work and hence a significant contribution to the medical (cardiac, in particular) image segmentation, as access to large amounts of expert-annotated ground truth imaging data is expensive in the medical field. Nevertheless, here we demonstrate that CqSL can still yield segmentation accuracy superior to other semi-supervised methods while requiring minimal annotated data for training.

Funding:
Research reported in this publication was supported by the National Institute of General Medical Sciences Award No. R35GM128877 of the National Institutes of Health, and the Office of Advanced Cyber infrastructure Award No. 1808530 of the National Science Foundation.

Figure 1.
Images, histograms and surface plots of two 3D cardiac images featuring all slices of two random patients from the ACDC dataset are illustrated in (a,b). From left to right: cardiac MR image in 4 dimensions, histogram plot, and surface plot.  Illustration of CqSL framework: Our model makes use of both labeled as well as unlabeled images. The first block (a) crops the input images to a specific dimension. Then, we disentangle the latent features of the images via a disentangled block. An input image is first encoded to a multi-channel spatial representation, SKd n = 1,2…8 . Then, SKd n can be fed into a segmentation network SI to generate a multi-class segmentation mask. (c) We train a generative network, which predicts semantic labels for both labeled and unlabeled data. (b) A sentiency encoder S e uses the factor SKd n and the input image to generate a latent vector z representing the imaging modality using a variational autoencoding block. (d) The decoder networks combine the two representations SKd n and z to reconstruct the input image.   Representative accuracy curves showing the training and validation accuracy of three different classes (RV blood-pool, LV blood-pool, and LV-Myocardium). Representative results showing the comparison across several best performing networks, including CqSL for the semantic segmentation of full cardiac image dataset from the base to apex showing of RV blood-pool, LV blood-pool, and LV-Myocardium on 20% labeled data in red, green, and yellow respectively.  Evaluation on the robustness of CqSL in terms of mean accuracy over RV, LV, and LV-Myocardium segmentation tasks on varying amounts of labeled training samples. Note significant improvement in Dice score across all CqSL semi-supervised variants for as little as 1% unlabeled data.
Hasan and Linte Page 33

Figure 13.
Representative segmentation contours of a complete cardiac cycle for the middle and apex slices showing RV and LV blood-pool, and LV-Myocardium in green, yellow, and brown, respectively, in three different view settings (axial, sagittal, and coronal). Qualitative comparison of the original and the reconstructed slices showing that the original images are well reconstructed by combining skeleton and sentiency information.The comparison is augmented by the computed correlation coefficients (CC) and peak signal-tonoise ratio (PSNR). The middle row illustrates the error images.
Reconstructions of a sample of input images when rearranging the spatial representation's channels. Rearranging the channels results in reconstructing only left ventricle blood-pool or only right ventricle blood-pool only or all the ventricular structures.
Hasan and Linte Page 36 Graphical comparison showing no statistically significant differences between clinical parameters estimated using CqSL segmentation and same parameters estimated using the ground truth segmentation in terms of Mean (Std. Dev.) EF (mL/mL (%)) = ejection fraction, Myo-mass (in gm) = myocardial mass. Empirical analysis showing the effect of different loss functions on the 2017 STACOM ACDC dataset. The significant reduction of total loss in CqSL (in red) suggests the best performing model with best learned features. The correlation between the CqSL-predicted and ground truth clinical indices is significantly higher than the correlation between the U-Net-predicted and same ground truth clinical indices (** (p < 0.01), * (p < 0.1)).