Whole Heart Segmentation Using 3D FM-Pre-ResNet Encoder–Decoder Based Architecture with Variational Autoencoder Regularization

: An accurate whole heart segmentation (WHS) on medical images, including computed tomography (CT) and magnetic resonance (MR) images, plays a crucial role in many clinical applications, such as cardiovascular disease diagnosis, pre-surgical planning, and intraoperative treatment. Manual whole-heart segmentation is a time-consuming process, prone to subjectivity and error. Therefore, there is a need to develop a quick, automatic, and accurate whole heart segmentation systems. Nowadays, convolutional neural networks (CNNs) emerged as a robust approach for medical image segmentation. In this paper, we ﬁrst introduce a novel connectivity structure of residual unit that we refer to as a feature merge residual unit (FM-Pre-ResNet). The proposed connectivity allows the creation of distinctly deep models without an increase in the number of parameters compared to the pre-activation residual units. Second, we propose a three-dimensional (3D) encoder–decoder based architecture that successfully incorporates FM-Pre-ResNet units and variational autoencoder (VAE). In an encoding stage, FM-Pre-ResNet units are used for learning a low-dimensional representation of the input. After that, the variational autoencoder (VAE) reconstructs the input image from the low-dimensional latent space to provide a strong regularization of all model weights, simultaneously preventing overﬁtting on the training data. Finally, the decoding stage creates the ﬁnal whole heart segmentation. We evaluate our method on the 40 test subjects of the MICCAI Multi-Modality Whole Heart Segmentation (MM-WHS) Challenge. The average dice values of whole heart segmentation are 90.39 % (CT images) and 89.50 % (MRI images), which are both highly comparable to the state-of-the-art.


Introduction
Functional irregularities of the heart and blood circulatory system are referred to as cardiovascular diseases (CVDs). CVDs cause significant degeneration of patients' life quality, while severe cases result in death. Research statistics provided by the World Health Organization show that CVDs account for 17.9 million deaths per year, which makes them the leading cause of death globally [1]. Early diagnosis of CVDs enables timely and appropriate treatment and prevention of patients' death. The diagnostic process includes obtaining images of unhealthy or weakened heart structures using imaging devices such as echocardiography, computed tomography (CT), or magnetic resonance (MRI). After that, collected images are observed, interpreted, and analyzed by clinical experts using specialized medical software built with advanced image processing methods.
Various clinical applications require insights into different cardiovascular system structures. For example, whole heart segmentation is crucial for pathology localization and hearts' anatomy and function analysis. The construction of patient-specific threedimensional (3D) heart models and surgical implants greatly benefits pre-surgical planning for patients with atherosclerosis, congenital heart defects, cardiomyopathy, or even inspecting different heart infections in post-surgical treatment [2]. Whole heart segmentation includes delineation of four heart chambers, the entire blood pool, and the great vessels, which makes manual segmentation by clinical experts time-consuming and prone to observer variability. There is an increasing focus on developing accurate and robust automatic image processing methods. The development of accurate, efficient, and automatic image processing and analysis methods is a complex task, especially in the medical field. The main reason is in high variability in image intensities distribution and dynamic properties of cardiac structures. Nevertheless, the advancements and rapid development in image processing, computer vision, and artificial intelligence fields significantly facilitate this challenging task.
A commonly used approach for medical image segmentation includes encoderdecoder-based architectures such as the U-Net architecture [3]. The U-Net architecture and its' corresponding three-dimensional counterpart, 3D U-Net [4], consist of contracting and expanding pathways. Throughout the contracting pathway, the network learns low-level features and reduces its numbers using sets of pooling and convolutional layers. In an expanding pathway, the network learns high-level features and recovers the original image resolution using deconvolutional layers. Features from contracting and expanding pathways are concatenated with skip connections to retrieve lost image information that occurs during the down-sampling process. Intuitively, this indicates that the part of the information is lost during the encoding process and can not be recovered when decoding. Variational autoencoders [5] enable regularization during the training to ensure that the latent space, i.e., encoded space, keeps the maximum of information when encoding, which results in the minimum reconstruction error during the decoding. Furthermore, since the number of features in the contracting pathway is significantly lower than the number in the expanding pathway, direct concatenation of these features may not produce the most optimal results. The increment in the number of layers provides larger parameter space enabling learning of more abstract features. Therefore, deeper architectures could provide more abstract learning that results in better performance and higher accuracy in medical segmentation tasks. Common obstacles in training deeper neural network architectures are the appearance of vanishing gradients, accuracy degradations, and extensive parameter growth that lead to computationally expensive models. Recent advancements have shown that convolutional networks can be significantly deeper and still preserve high efficiency and accuracy if they contain shorter and direct connections in between each layer. The introduction of skip connections in ResNets [6] allows for copying of the activations from layer to layer. Since some features are best constructed in shallow networks and others require more depth, skip connections increase the network's capability, flexibility, and performance.

Research Contributions
Motivated by previously described advancements, in this paper, we present a novel 3D encoder-decoder based architecture with variational autoencoder regularization. Our intention is to achieve maximum optimality in training performance, efficiency, and final segmentation result accuracy for the whole heart segmentation task. The work contributions can be summarized as:

1.
We propose a new connectivity structure of residual unit that we refer to as feature merge residual units (FM-Pre-ResNet). The FM-Pre-ResNet unit attaches two convolution layers at the top and at the bottom of the pre-activation residual block. The top layer balances the parameters of the two branches, while the bottom layer reduces the channel dimension. This allows the construction of a deeper model with a similar number of parameters compared to the original pre-activation residual unit.

2.
We present a 3D encoder-decoder based architecture that efficiently incorporates FM-Pre-ResNet units and is additionally guided with variational autoencoders (VAE). The architecture includes three stages. First, in an encoding stage, FM-Pre-ResNet units learn a low-dimensional representation of the input. Second, in the VAE stage, an input image is reduced to a low-dimensional latent space and reconstructs itself to provide a strong regularization of all model weights and is only used during the network training. Finally, the decoding stage creates the final whole heart segmentation. 3.
The proposed approach obtains highly comparable dice scores to the state-of-the-art for whole heart segmentation tasks on CT images while outperforming the current state-of-the-art on the MRI images.

Paper Overview
The remainder of the paper is structured as follows. In Section 2, we give an overview of related research. First, we briefly review previous research regarding whole heart segmentation. After that, we provide a background of the most successful ResNet variants and feature reuse mechanisms. Section 3 gives details about our proposed method. First, we introduce an overall architecture design. After that, we present the encoder and decoder stages as well as a theoretical explanation of the new connectivity structure of the residual unit. We introduce the variational autoencoders and their role in the proposed architecture as well. Section 4 provides dataset description, implementation details, and presents conducted experiments and obtained results for the whole heart segmentation. Finally, the concluding remarks are given in Section 5.

Related Concepts
In this section, we review some related works. First, we briefly review the prior methods in the whole heart segmentation tasks, focusing on CNN-based approaches. After that, we present significant residual network variants and feature reuse mechanisms relevant to our research.

Previous Methods for Whole Heart Segmentation
The field of whole heart segmentation is a frequently researched area due to its extremely high importance in clinical practice. Various segmentation algorithms and methods have been proposed over the years. Detail reviews of previously published methodologies can be found in [7][8][9][10]. Prior methods are fundamentally classified into traditional segmentation methods (active contours, deformable models, registration, atlas-based frameworks) and CNN-based segmentation methods. For example, Zhuang et al. [11] propose a new global atlas ranking scheme where both the global and local atlases use the theoretical measures information for computing similarity between the atlases and target image. Galisot et al. [12] propose a method that combines topological graphs and local probabilistic atlases for learning priors. At the same time, the hidden Markov field (MF) integrates the learned information and provides final pixel classification. A significant limitation of registration-based methods is a requirement for high similarity in volume orientation and acquisition protocols between target and atlas images. Recent advancements introduce a simple linear iterative clustering (SLIC) super voxel method for prevention of misregistration by detecting a bounding box region enclosing the heart, after which heart segmentation is performed subsequently [13]. Although registration and atlas-based frameworks usually have high accuracy and precision, algorithms are still not robust enough, which often leads to unsatisfactory results when the data quality is poor.
Recently, CNN-based methods have shown superior performance in the field of medical image segmentation and analysis. For example, Payer et al. [14] use two separate CNNs: first to localize heart and second to segment the fine details of the heart structures within a small region of interest (ROI). The localization network predicts the approximate center of ROI using heatmap regression [15] and the U-Net. Final pixel predictions are acquired using the multi-label segmentation CNN. Wang et al. [16] combine statistical shape priors with CNNs to extract 3D shape knowledge using the shape context method. They detect ROI using a random forest landmark detection, after which they generate a probability map using multi-view slices and three 2D U-Nets. Finally, they apply a hierarchical shape prior algorithm [17] on the probability map to estimate the shape of each heart structure.
Sundgaard et al. [18] use 2D CNNs with a multi-planar method to investigate the power of retaining spatial information across slices, as is the case of 3D networks. Mortazi et al. [19] present a multi-planar CNN method using an encoder-decoder architecture. After that, they apply an adaptive fusion [20] to obtain refined segmentation. This method requires less memory in comparison to the 3D counterpart. Liao et al. [21] propose a multi-modality transfer learning method that combines spatial attention mechanism for retaining and removing useful and redundant information, respectively. Furthermore, Dou et al. [22] apply deep supervision during network training to obtain faster convergence and higher distinguishing ability, while Tong et al. [23] combine ROI detection and 3D deeply-supervised U-Net to reduce the computational complexity. Although convolutional neural networks perform well, the limited amount of annotated data requires the development of efficient and more complex, deeper network architectures.

ResNet Variants and Feature Reuse Networks
Deep convolutional neural networks have shown a significant increase in the accuracy for various segmentation and classification tasks. However, a common obstacle in training deep neural network architectures is the appearance of vanishing or exploding gradients. As the depth of CNN increases, information about the gradient passes through many layers, and it can vanish or accumulate large errors by the time it reaches the end of the network. This problem has been largely addressed using activation functions with a small derivate such as rectified linear unit (ReLU), implementation of gradient clipping, intermediate normalization layers, or careful weight initialization. Nevertheless, with the increasing network depth, accuracy gets saturated and then degrades rapidly. The introduction of skip connections in ResNets [6] allows for copying of the activations from layer to layer, thus preserving information and significantly increasing the performance. Nevertheless, when the depth of the network goes very deep, ResNets become challenging to converge. These difficulties were addressed in Pre-ResNets [24] by introducing forward and backward signals that directly propagate from one block to any other using identity mappings after-addition activation and as the skip-connections. This ultimately constructs a new residual block with the BN-ReLU-Conv order. Zagoruyko [25] introduces level-wise shortcut connections to alleviate the learning capability and significantly boost network performance. Moreover, the deep network initialization problem and incompatibility between ReLU and element-wise summation were addressed in weighted residual networks (WNR) [26]. Although deeper residual networks showed performance improvement, diminishing feature reuse slows down network training. This was addressed by increasing and decreasing the width and depth, respectively, in improved WNRs [27].
Furthermore, another efficient way to alleviate network performance is by reusing features. DenseNet [28] introduces connections between all successive layers in a feedforward manner where features from each preceding layer are used as inputs to every other layer. This means that each layer is receiving cumulative knowledge from all prior layers, i.e., it reuses features. A variety of compelling benefits are obtained with the introduction of direct connections between layers. First, it allows more depth of the network while simultaneously alleviating the vanishing and exploding gradient problems. Second, the use of features from all layers leads to improvements in the performance. Finally, it efficiently utilizes parameters. This allows for less propensity to overfitting and leads to a reduction of computational costs. CondenseNet [29] combines dense connectivity with a group convolution to further facilitate feature reuse through the network. Here, the group convolutions aim at removing direct connections between layers allowing distinctly smooth feature reuse.

Methodology
In this section, we present a theoretical overview of the proposed encoder-decoder based architecture. First, we give an overall architecture design and introduce the main building blocks and their purpose. After that, we give a theoretical background of the main components of our method: feature merge residual units (FM-Pre-ResNet) and variational autoencoder. Finally, we describe the used loss function.

Architecture Overview
An image segmentation task can be written as mapping: where i ∈ I denote input images, while o ∈ O denote their corresponding segmentations.
For an encoder-decoder based architecture, the same mapping function can be written as: where E Ω , D ∆ are an encoder and the decoder networks parametrized by Ω and ∆, respectively. Introduction of shared VAE, expressed with V Λ (·), at the encoders' endpoint allows mapping of input images to a lower-dimensional latent, i.e., encoded, space. The output of an encoder E Ω contains the samples from the latent space, which we in detail discuss in Section 3.3. Therefore, our proposed architecture consists of three main stages: (1) encoding stage, (2) reconstruction of the input with variational autoencoder, and (3) decoding stage. An encoding stage incorporates feature merge residual units by which the network learns a low-dimensional representation of the input. The variational autoencoder reconstructs the input image from low-dimensional latent space to regularize all model weights and adds additional guidance to the encoding stage. Finally, in the decoding stage, the network learns high-level features and creates the final segmentations. An illustration of the proposed architecture is shown in Figure 1. Input is a volumetric CT or MRI image. Each red block is the FM-Pre-ResNet block. The VAE branch is added at encoders' output and is used only during training. The decoder stage creates the final whole heart segmentation.

Encoding Stage
The ResNets contains multiple stacked residual units. Generally, each residual unit can be expressed with the following two formulations: and where F is residual function, x l and x l+1 denote the input and output of the l − th residual unit in the network, while the output of the l − th residual unit is denoted with y l . The parameters of the l − th residual unit are denoted as W l , while the function f refers to the rectified linear unit (ReLU). The identity mapping, by which ResNets learn residual function F in regard to H(x l ), can be written as: Therefore, the identity mapping of original residual block attaches an identity skip connection allowing information flow within a residual unit as shown in Figure 2a.
As introduced in Pre-ResNets (Figure 2b), if H(x l ) and f are both an identity mapping, the direct propagation of information through the entire network in forward and backward fashion can be written as: Following the concept described in Equation (6), we propose a novel feature merge residual unit that can be written as follows: and where • presents the concatenation operation, Z(x l ) denotes the concatenated result, while the functions z and g denote the convolution layers, added at the top and at the bottom of the residual unit, respectively. In this manner, the top convolution layers' output is concatenated with the residual signals' output, which allows the merge of features from preceding layers. After that, the concatenated result is passed through a bottom convolution layer to reduce channel dimension, as shown in Figure 2c.

Variational Autoencoder Stage
Variational autoencoders are able to capture latent representations, which makes them ideal for use in generative settings [5,30]. The evidence lower bound (ELBO), which is VAEs optimization objective, can be written as: where the term L REC (i,î) denotes reconstruction loss, and can be further written as: whereî denotes the reconstructed input.
The term K, L[q Λ (r|i)||p(r)] from Equation (9) defines the KL divergence of the approximating variational density, which can be expressed as: The standard prior on the latent variable can be written as: where the aligned Gaussian (µ Λ , σ 2 Λ ) is expressed by the encoder network V Λ (·). Following this, the low dimensional representations of the input data i can be obtained by introducing the latent random variable r. The input images are mapped to a low dimensional space using VAE encoder V Λ (·). After that, the output of the encoder of the segmentation network, E Ω (·), takes samples from the latent space as shown in Figure 1. In this manner, segmentation encoder and VAE jointly share the decoder D ∆ (·), which can be also written as: The final segmentation,ô, is obtained from the decoder using following expression: which can be written as: where • denotes concatenation, H = E Ω (i) is the output of the segmentation encoder and r ∼ q Λ (r|i) is a sample from the latent space that is learnt by VAE.

Loss Function
The loss function plays an important role in improving the models' performance. In this work, we employ total loss function that is the addition of soft dice loss, L2 loss and standard VAE penalty term, and can be written as: The term that represents the soft dice loss, L dice [31], can be written as: where denotes a small constant used for computational stability, i.e., to avoid zero division.
The loss L2 represents loss on the VAE encoder output and can be written as: The standard VAE penalty term, L KL , represents KL divergence between a prior distribution N(0, 1) and the estimated normal distribution N(µ, σ 2 ), which can be written as: where N represents the entire set of image voxels. Finally, the hyperparameter weight of 0.1 is empirically found to provide a good balance between VAE loss term and soft dice loss in Equation (16).

Decoding Stage
The decoder building blocks highly follows concepts described in Section 3.2, i.e., it consists of FM-Pre-ResNet units. Every FM-Pre-ResNet unit in decoder doubles the spatial dimension while reduces the feature numbers by a factor of 2. Each decoder level is concatenated with the corresponding encoder output. The final layer of the decoder provides whole heart segmentation and has the same number of features and spatial size as the original input image.

Implementation Details
In this section, we give a dataset description on which we conducted our experiments. After that, we give details about network training and implementation. Finally, we evaluate the proposed method using Multi-Modality Whole Heart Segmentation Challenge (MM-WHS) dataset and present conducted experiments and results. We investigate and compare our results to the state-of-the-art research.

Dataset Description
The network presented in the previous section was applied to the whole heart segmentation task and is evaluated on a dataset provided by Multi-Modality Whole Heart Segmentation Challenge (MM-WHS) organized by MICCAI [32]. The MM-WHS dataset consists of 60 cardiac CT/CTA and 60 cardiac MRI volumes, whereas 20 volumetric images include corresponding ground truths, manually labeled by two clinical experts. In contrast, the remaining 40 volumetric images are used for testing purposes. Ground truths of the testing dataset are provided in encrypted form and can be decoded to evaluate algorithms using the procedure described in [32]. The data were collected on patients from the everyday clinical environment, and it has a various quality to preserve the robustness of the developed algorithms when it comes to real clinical usage. The cardiac CT/CTA data are obtained using 64 slice CT scanners using a standard coronary CT angiography protocol, and cardiac MRI data were acquired using a navigator-gated 3D balanced steady-state free precession (b-SSFP) sequence for free-breathing whole heart imaging. The axial slices' in-plane resolution is 0.78 × 0.78 mm, and the average slice thickness is 1.60 mm. The data include the following seven structures of the heart: (1) the left ventricle blood cavity (LV), (2) the right ventricle blood cavity (RV), (3) the left atrium blood cavity (LA), (4) the right atrium blood cavity (RA), (5) the myocardium of the left ventricle (Myo), (6) the ascending aorta (Ao), which is defined as the aortic trunk from the aortic valve to the superior level of the atria, and (7) the pulmonary artery (PA). An example of one slice from the used dataset is shown in Figure 3.

Training and Implementation Details
To alleviate the irregularities of variable contrast in some MRI images, we normalize all input images (both CT and MRI) to have zero mean and unit std. The volumes were cropped and zero-padded to a fixed size of 176 × 224 × 144 to provide a fine ROI for the network input while making sure all heart structures are inside the selected ROI. We apply three different data augmentation methods on input image channels to increase the sample size of training data and enhance the robustness and generalization ability, namely random axis mirror flip, random scaling, and intensity shift. Random axis mirror flip creates a mirror reflection of an original image along one (or more) selected axis and is commonly flipped at a rate of 50%. Random scaling operation S scales input image and performs independently in different directions. Intensity shift performs an element-wise addition of a scalar to the image and affects the brightness of the original image. Details about parameters of used data augmentation methods are presented in Table 1 while examples of input images after applying different data augmentation methods are shown in Figure 4. Moreover, we empirically found that advanced augmentation techniques, such as random histogram matching, or random image filtering, do not show any additional improvements to the final segmentation result.  While optimization methods such as stochastic gradient descent (SGD) or gradient descent with momentum provide a powerful optimization for training CNNs, choosing the most optimal learning rate α is not a trivial task. If α is chosen to be too large, the training may not converge, might oscillate, or skip over relevant local minima. On the other hand, if it is chosen to be too small, it significantly delays the convergence process. Therefore, in this work, we use adaptive learning rate optimizer Adam with initial learning rate of α 0 = 10 −4 and gradually decrease it according to the following expression: where T c is a total number of epochs (200 in our case) and c is an epoch counter. Furthermore, to ensure our models generalizes well on unseen data, i.e., to reduce the effect of overfitting or underfitting, we employ L2 norm regularization with a weight of 10 −5 and the spatial dropout with a rate of 0.2 after the initial encoder convolution. Since early stoping aims to regularize the finding the network parameters at the point of the lowest validation loss, we implement early stopping with patience set to 50.
In our experiments, we train four encoder-decoder based architectures: (1) 3D Pre-ResNet without VAE regularization, (2) 3D Pre-ResNet with VAE regularization, (3) FM-Pre-ResNet without VAE regularization, and (4) FM-Pre-ResNet with VAE regularization. All four networks are trained from scratch and separately for CT and MRI images. The whole experimental procedure is implemented in Pytorch and trained on two NVIDIA Titan V GPU simultaneously. The source code of our method is available at [33]. Our training and validation dataset consists of 20 CT volumes and 20 MRI volumes, with 80-20% training and validation split, respectively. The trained networks are evaluated using a testing dataset that includes 40 subjects for both CT and MRI images. The proposed 3D FM-PreResNet + VAE architecture produced the highest validation accuracy of 94.40% at the end of the 200th epoch. The network is trained for 200 epochs since further training appears not to decrease validation loss as shown in Figure 5. Moreover, Figure 5 shows that, with an increase in epochs, the loss value decreases, and the accuracy increases. This is a clear indication that the network is successfully learning from the input data.

Evaluation Metrics
To evaluate the proposed methodologys' performance, we compare ground truth masks with obtained segmentations for each CT and MRI volume. We used four different metrics to evaluate segmentation accuracy, namely, the Dice similarity coefficient (DSC), Jaccard Index (JI), surface distance (SD), and Hausdorff distance (HD). DSC and JI measure the level of overlap between the ground truth and predicted segmentations, while SD and HD examine boundary distances. The DSC metric measures the degree of overlap between the ground truth and predicted segmentation. It is a commonly used metric for evaluating segmentation quality and can be written as: where G is the ground truth, and P is the predicted mask. Similarly, the Jaccard Index (JI) emphasizes the size of the intersection divided by the size of the union of the sample sets. The mathematical representation of the JI can be written as: where G is the ground truth, and P is the predicted mask. SD measures an average of the minimum voxel-wise distance between the ground truth and predicted object boundaries and can be written as: where n G and n P denote the number of voxels on the object boundaries in the ground truth and predicted segmentations, respectively. Furthermore, HD represents the maximum of the minimum voxel-wise distances between the ground truth and predicted object boundaries and can be written as: where g is the ground truth, and p is the predicted mask.

Experiments and Results
To demonstrate the effectiveness of the proposed approach and our design choice for the new FM-ResNet unit, we train encoder-decoder based architecture using 3D Pre-ResNet without and with VAE regularization as well as the proposed FM-Pre-ResNet without and with VAE regularization. Table 2 summarizes an average whole heart segmentation results on CT and MRI images. On CT images, the 3D Pre-ResNet network achieves average WHS segmentation results for DSC, JI, SD, and HD of 87.11%, 80.16%, 1.71 mm, and 24.44 mm, respectively. The addition of VAE at Pre-ResNet segmentation encoders' endpoint improve DSC, JI, SD and HD values for 2.12%, 1.0%, 0.2039 mm, and 2.704 mm, respectively. The 3D FM-Pre-ResNet network achieves DSC, JI, SD, and HD values of 90.03%, 82.14%, 1.43 mm, and 18.82 mm, respectively. Compared to the 3D Pre-ResNet, it achieves improvement in DSC, JI, SD, and HD values of 2.92%, 1.98%, 0.2789 mm, and 56, 307 mm, which means that the proposed FM-PreResNet unit significantly improves segmentation accuracy. Moreover, the highest DSC, JI, SD, and HD are achieved using 3D FM-Pre-ResNet + VAE network and report values of 90.39%, 82.24%, 1.1093 mm, and 15.3621 mm, respectively.
The 3D FM-Pre-ResNet network achieves average DSC, JI, SD, and HD values of 88.40%, 78.55%, 2.4558 mm, and 32.0451 mm, respectively. Compared to 3D Pre-ResNet, it achieves improvement in DSC, JI, SD, and HD values of 5.34%, 3.01%, 3.4643 mm, and 10.5127 mm, which means that the proposed FM-PreResNet unit significantly improves segmentation accuracy. Moreover, the highest DSC, JI, SD, and HD are achieved using 3D FM-Pre-ResNet + VAE network and report values of 89.50%, 80.44%, 1.8599 mm and 25.6558 mm, respectively. These results highlight the improvement in segmentation accuracy afforded by the introduction of FM-Pre-ResNet units and VAE.
Boxplots showing the distribution of the DSC for WH, LV, Myo, LA, RA, RV, AO, and PA using different segmentation networks on MMWHS CT and MRI testing datasets are presented in Figures 6 and 7, respectively. Additional, structure-wise segmentation accuracies for the LV, RV, LA, RA, Myo, Ao, and PA, for both CT and MRI images, are summarized in Tables 3 and 4.   The p-values have been calculated using a Wilcoxon rank-sum test to show the significant difference of used architectures. Bonferroni correction was used for controlling the family-wise error rate. Figures 8 and 9 show the comparisons and p-values for CT and MRI testing datasets, respectively.
The visual inspection of the obtained segmentations using each network investigated in this work is presented in Figure 10 for the CT dataset, and Figure 11 for the MRI dataset. For example, Figure 11d shows clear improvements regarding LV segmentation that is obtained using FM-Pre-ResNet compared to missed segmentation of LV parts while using Pre-ResNet without a proposed feature merge residual unit as shown in Figure 11b. Moreover, Figure 11e shows a significant reduction in segmentation error compared to all other presented networks. This further highlights the benefits of the proposed FM-Pre-ResNet + VAE approach. Nonetheless, in both modalities, PA and Myo's segmentation results are significantly lower than other substructures due to high shape variations and heterogeneous intensity of blood fluctuations. Figure 12 shows 3D visualization of the best and the worse segmentation cases on the CT and MRI test dataset obtained using the proposed FM-Pre-ResNet +VAE approach.     Furthermore, Pre-ResNet has demonstrated that increasing the depth of the network improves model performance significantly. The addition of two convolutional layers at the top and bottom of the pre-activation residual block introduced in our FM-Pre-ResNet unit allows for the feature fusion block to reach the same depth with fewer parameters which benefits model performance. Therefore, the proposed type of connectivity of the FM-Pre-ResNet unit in terms of depth and number of parameters regarding Pre-ResNet implies no increase in the number of parameters compared to the Pre-ResNet. Time-wise, each training epoch (200 cases) and prediction times on two GPU-s (NVIDIA Titan V) are significantly reduced with architectures with VAE. This shows the computational efficiency of our choice for VAE introduction. Comparison of depth, number of parameters, training times per epoch, and prediction time of one volume for different architectures is shown in Table 5. Comparison with State-of-the-Art Methods The proposed approach was compared with other similar deep learning approaches in terms of image segmentation accuracy as shown in Tables 6 and 7. An approach that combines atlas registration with CNNs' [12] provides an incremental segmentation that allows user interaction, which can be beneficial in a clinical setting. Nevertheless, the challenges of accurate atlas registration resulted in low accuracy on MRI images. The deep supervision mechanism [23] and use of transfer learning [21] result in an increase of trainable parameters and overall network complexity. In contrast, we aim to introduce a light-weight network that results in a significantly deep network without increasing the parameter number. Moreover, the authors in [21] report an average WHS DSC of 0.914 ± 0.075 on CT images and 0.890 ± 0.054 on MRI images using a hold-out set of 10% of training data and evaluate their method with 10-fold cross-validation. Our results report 0.9039 ± 0.0517 on CT images and 0.8950 ± 0.0215 on MRI images and are evaluated on all unseen 40 subjects, which shows that the VAE stage's introduction significantly helps in overcoming overfitting problems. Therefore, these results highlight the advantages of our proposed method. Table 6. Comparison of an average DSC, JI, SD, and HD of the state-of-the-art whole heart segmentation methods on CT images. encoder layers. The segmentation encoder learns a low-dimensional representation of the input, after which VAE reduces the input to a low-dimensional space of 256 (128 to represent std, and 128 to represent mean). A sample is then drawn from the Gaussian distribution with the given std and mean and reconstructed into the input image dimensions following the same architecture as the decoder but without inter-level skip connections. Therefore, VAE acts as a regulator of model weights, adds additional guidance, and exploits the encoder endpoint features. In the end, the segmentation decoder learns high-level features and creates the final segmentations. We evaluate the proposed approach on MMWHS CT and MRI testing datasets and obtain average WHS DSC, JI, SD, and HD values of 90.39%, 82.24%, 1.1093, 15.3621 for CT images, and 89.50%, 80.44%, 1.8599, 25.6558 for MRI images, respectively. Results for both datasets are highly comparable to the state-of-the-art.