Numerical Evaluation on Parametric Choices Inﬂuencing Segmentation Results in Radiology Images—A Multi-Dataset Study

: Medical image segmentation has gained greater attention over the past decade, especially in the ﬁeld of image-guided surgery. Here, robust, accurate and fast segmentation tools are important for planning and navigation. In this work, we explore the Convolutional Neural Network (CNN) based approaches for multi-dataset segmentation from CT examinations. We hypothesize that selection of certain parameters in the network architecture design critically inﬂuence the segmentation results. We have employed two different CNN architectures, 3D-UNet and VGG-16, given that both networks are well accepted in the medical domain for segmentation tasks. In order to understand the efﬁciency of different parameter choices, we have adopted two different approaches. The ﬁrst one combines different weight initialization schemes with different activation functions, whereas the second approach combines different weight initialization methods with a set of loss functions and optimizers. For evaluation, the 3D-UNet was trained with the Medical Segmentation Decathlon dataset and VGG-16 using LiTS data. The quality assessment done using eight quantitative metrics enhances the probability of using our proposed strategies for enhancing the segmentation results. Following a systematic approach in the evaluation of the results, we propose a few strategies that can be adopted for obtaining good segmentation results. Both of the architectures used in this work were selected on the basis of general acceptance in segmentation tasks for medical images based on their promising results compared to other state-of-the art networks. The highest Dice score obtained in 3D-UNet for the liver, pancreas and cardiac data was 0.897, 0.691 and 0.892. In the case of VGG-16, it was solely developed to work with liver data and delivered a Dice score of 0.921. From all the experiments conducted, we observed that two of the combinations with Xavier weight initialization (also known as Glorot), Adam optimiser, Cross Entropy loss ( Glo AdamCE ) and LeCun weight initialization, cross entropy loss and Adam optimiser Lec AdamCE worked best for most of the metrics in a 3D-UNet setting, while Xavier together with cross entropy loss and Tanh activation function ( Glo tanhCE ) worked best for the VGG-16 network. Here, the parameter combinations are proposed on the basis of their contributions in obtaining optimal outcomes in segmentation evaluations. Moreover, we discuss that the preliminary evaluation results show that these parameters could later on be used for gaining more insights into model convergence and optimal solutions.The results from the quality assessment metrics and the statistical analysis validate our conclusions and we propose that the presented work can be used as a guide in choosing parameters for the best possible segmentation results for future works. − x ) , hy-perbolic tanh(x) and reLU functions. For the ﬁnal layer we have applied Softmax function s = . The initial learning rate ( lr ) used here is 1 × 10 − 4 and the weight decay is 0.0002. All the experiments are done using NVIDIA Tesla V100 (Nvidia, Sabta Clara, CA, USA) using public datasets SLIVER07 for the purpose of training and testing.


Introduction
Over the past 20 years, Image Guidance Systems (IGS) have gained greater attention due to their numerous benefits of better control over the surgical procedure, reduced morbidity, shortened OR times and overall better patient outcomes [1]. Accurate segmentation (the process of extracting the region of interest) of organ structures in the medical images is a key part of IGS systems [2]. This assists the clinicians during diagnosis to localize the abnormalities, evaluate tissue volume and plan for the treatment pre-operatively and intra-operatively [3]. Computed tomography (CT) images, magnetic resonance imaging (MRI) and ultrasound (US) images are the widely used modalities for segmentation. Semiautomatic and fully automatic segmentation methods performed on these modalities using different techniques has been an active area of research for a long time [4]. However, there are still certain challenges to be overcome while performing medical image segmentation, especially for those organs like the liver that have a remarkable intensity similarity with the adjacent organs like heart, stomach and spleen. Also, intensity in-homogeneity often contributed by imaging artifacts and pathological conditions can make the process challenging [4].
In recent years, the application of machine learning (ML) and deep learning (DL) contributed widely to the development of automatic segmentation methods in medical imaging [5,6]. Deep learning-based algorithms have been applied to a wide variety of problems and have been proven efficient compared to traditional techniques in many aspects including accuracy, speed and robustness. Deep learning refers to stacked neural networks, which is a linear combination of many functions. The stacked neural networks represent several layers that combine the whole architecture. Each layer is made up of different nodes where the computations happen when they receive inputs. While training, each layer extracts features from a low level to a higher level. Variables that define the network structure and how the network is to be trained are called the hyper-parameters. Hyper-parameters are very influential on parametric values, where the values of weights and bias are a result of the selection of these hyper-parameters. For the model selection process, we start with an initial hypothesis set. Once the decision on the model to be used from this hypothesis set is made, training using whole training data is initiated. After training, the model is validated on the validation dataset, and later on test data to measure the accuracy. The selection of hyper-parameters is very crucial in determining the performance of the network model. There is always a trade-off between these choices with the quality of solutions and the computation time required [7]. Often referred to as the trickiest part of designing the network models, these parameters can deliver premature convergence or least convergence of models if not chosen wisely. Usually, this process could be a trial and error method, but researchers are investigating on proposing better combinations of these hyper-parameters [8]. In this research paper, we are considering different aspects of these variables that determine the network architecture. We will be exploring the different possible combinations and their influence in deciding the network efficiency for predicting accurate segmentation results on medical image modality CT data. Finally, we open a possible combination of these parameters that have been applied to a pre-trained model and make a performance analysis of each of them. The main objective of this research work underlies in finding the significance of choosing optimal parameters and their effect on training performance and tasks to be done. Following our findings, in-fact for dealing with the possibility of generalization, we conducted different experiments on different datasets such as the liver, pancreas and cardiac data. The promising results from these experiments prove that we can introduce these combinations as a generalized approach for achieving improved segmentation results.
In this paper, we methodically studied the impact of different combinations that influence the network performance on the prediction of results. • We tested different combinations of parameters for organ segmentation on CT modality, including liver, cardiac and pancreas. • Analysis of incremental performance while using these combinations were carried out.
• We present persistent results on the pre-trained CNN models using the proposed combinations, which convincingly provide better performance on multi-dataset segmentation on CT images.

Related Works
In the past 10 years, there has been a significant research contribution worldwide for the development of CNN for various tasks, including image segmentation, detection, classification, etc. [9,10]. The image segmentation process of enormous medical data can be done using different architectures mainly based on 2D CNN and 3D CNN. The 2D CNN architectures usually work in a slice-by-slice fashion whereas for volumetric analysis 3D CNNs are employed [11]. End-to-end training of models for pixel-wise semantic segmentation is done using FCNs [12], whereas 3D U-Net is more likely accepted by the researchers [13,14]. Another architecture that has proved its efficiency in multi-tasks of classification, detection and segmentation are VGG-16 [15].
Regardless of the network used, designing a deep learning-based model is a multiphase process. From the collection of data to obtaining results perhaps requires more attention and wise decisions to be made. Once the data has been gathered, data preparation processes such as data pre-processing and data augmentation make it suitable for training. For the next step, we design the network architecture, either by building or choosing a suitable base-architecture followed by training the network using the collected data and evaluation on task performance. Finally, the results obtained will be analyzed and strategies to improve network performance will be adopted. This process includes training data analysis, tweaking of hyper-parameters, use of different parameter choices or even changing the entire architecture [16]. In the literature, few of the works focused on studying different characteristics of ML algorithms, investigating the features of backpropagation and weight updates [17]. To better understand the working of the designed network architecture, we need to have knowledge of different underlying concepts. This includes the number of layers to be introduced, units per layer, type of layers, cost function, optimizing algorithms, etc. [18] studied the impact of weight initialization together with momentum in obtaining desired results. Proper weight initialization is an important factor with a strong impact on deciding the training time as well as the quality of the resulting network model [19]. In fact, an improper weight initialization scheme can result in poor convergence of the model [20]. Reference [21] demonstrated the impact of choosing the right activation function on training dynamics and model performance. In [22], the authors proposed a strategy for the selection of hyperparameters that includes learnable parameters such as weights and biases of each layer, including the number of filters, strides, kernel sizes and the number of units per layer. In [23], the authors worked on studying different loss functions used in deep neural networks with the objective of knowing the impact of particular choices in learning dynamics for classification as a task. In [24], authors worked on improving the accuracy of the CNN model by experimenting with different combinations of weight initialization and activation functions. Breuel [25] conducted large scale experiments to observe the effect of hyperparameters including learning rate, batch size and depth of the network based on a simple SGD training. In [26], the author presents a wide research on the effect of batch normalization in deep neural networks. The paper concludes that batch normalization is a beneficial addition to neural network problems. Reference [27] proposes a new method of hyperparameter optimization by combining Bayesian optimization and Hyperband. In [28], the authors implemented a CNN that works for Natural Language Processing, where they varied different parameters to study on the effect of these on CNN performance. The authors conclude that less-complex CNN have small amout of parameter adjustments that can achieve significant improvement. Recently in [29], the authors studied on the influence of activation functions on CNN model. This CNN model designed for facial recognition has been tested on five different activation functions including Sigmoid, Tanh, ReLU, leaky ReLUs and softplus-ReLU, and also with a new activation function that is proposed in the paper. In [30], the authors experimented on large number of hyperparameter configurations to investigate on how they effect the performance of deep neural networks (DNNs) and identifies activation function, dropout regularization, number hidden layers and neurons plays a critical role. A comparative analysis of hyperparameter effects were carried out on [31], and proposes that right choice of parameter selection directly affects the learning and predictions. Inspired from the literature works, in this paper, we also focus on experimenting with different combinations of these parameters to analyze the impact on accuracy on making the choices by evaluating it with a wide range of quality metrics. This paper is organized as follows. Section 1 gives an introduction to the paper. Section 2 describes the background and the related work. In Section 3, various methods, datasets and metrics used to evaluate the results are being presented. In Section 4, the experiments comparing different hyper-parameters and their corresponding results are discussed. In Section 5, the conclusion and future work are presented.

Materials and Methods
In this work, we have adopted two different approaches to perform automatic segmentation by using the network to learn from combinations of base activation functions and weight initialization techniques. We present our work through two well-known architectures 3D U-Net and VGG-16, on two standard datasets (Medical Decathalon and LiTS), showing that there are possibilities for substantial improvements in the overall network performance. For the experiment with 3D U-Net, we focused on comparing the performance with several weight initializers with different optimizers and loss functions, keeping ReLU as an activation function. For the VGG-16 network, we tried experimenting with different weight initializers combining different activation functions.

Convolutional Neural Networks
Different from regular networks, convolutional neural networks maintain a different architecture. The layers of a CNN model are organized accordingly giving 3-dimensional information. These correspond to the width, height and depth. Nodes in one layer are not necessarily connected to all other nodes in the next layer and can be connected with a selected portion of the same. The output is reduced to probability scores represented as a single vector which is organized along the depth dimension. These convolutional layers are responsible for identifying the low-level and high-level features in all locations of the input data. Convolution refers to a mathematical expression of combining two mathematical functions where the outcome is another function. This can be interpreted as integrating different information to deliver new information. The convolution process is performed using convolution kernels which then produces feature maps after convolving with the input data representations. The main two network models that we use in this experiment is VGG-16 and 3D U-Net. The main reason for selecting these models was their wide acceptance for biomedical image segmentation [10,[32][33][34].
In this work, the experiments were done using two sets of parameters, one that worked solely for the liver segmentation and the other for a generalized version that worked for multi-dataset segmentation. For both, we experimented with the same values and selected the best choices to present in the paper. From our results and observations, we infer that the presented combinations can be used for getting better segmentation results. Figure 1 represents the workflow of the whole evaluation process we followed. In general, we split the dataset as training, validation and test data. The selected parameter combinations are applied to the network architecture chosen, later on the trained/learned model will be tested using the test data and predictions were analysed using the quality metrics. The quality metrics chosen for this study to evaluate the segmentation predictions were based on the recommendations given in [35]. We used VGG-16 as the base architecture for the initial liver segmentation model. VGG-16 is now widely used for different tasks including segmentation, detection and classification.
In neural networks, the role of hyperparameters is very important in predicting the outcome. Each node in a network has one or more scalar inputs and a single output. The edges that link between these nodes from layer to layer have a scalar weight and a bias factor. Each node has an output that can be represented using Equation (1).
Here x i 's are the inputs that are coming to the node, w i 's are the weights associated with edges that make connections to that particular node and b is the bias factor. The f (x) is the activation function that determines the output of that particular node.
We will be exploring many of these activation functions to come up with better performance.

Weight Initialization
Neural networks are more than a convex problem. This stands for the fact that for neural networks there are multiple possibilities of having local minima, where one can be better than the other. So the weight initialization is an important factor in reaching the required local minima.

LeCun Initialization
Reference [36] initialize the weights with scaled Gaussian distribution where each element of the array is initialized by the value drawn independently from Gaussian distribution whose mean is 0, and the standard deviation is 1 n in using, where n in is the number of input units in the weight tensor.

Xavier Initialization
Reference [37] experimented with the influence of non-linear activation function. The non-linear logistic sigmoid activation function is not suited for random initialization of deep neural network due to its non-zero mean value which can drive especially the top layers of the network into saturation. The authors proposed a new linear initialization method that saturates less often and substantially brings faster convergence [37]. The initialization method is known as Glorot/Xavier initialization. This initializer keeps the scale of the gradients roughly the same in all layers. Its derivatives are based on the assumption that the activations are linear. The method initializes the weights by drawing the samples from a truncated normal distribution centered on zero with a standard deviation of 2 n in +n out .

He Initialization
Reference [19] proposed a robust initialization method built on Xavier initialization that particularly considers the rectified non-linearities. Unlike Xavier initialization, the method can make an extremely deep neural network to converge. In He initialization method, the weights are initialized based depending on the size of the previous layer. The weights are still random but differ in the range based on the size of the previous layer of neurons. The method draws samples from a truncated normal distribution centered on zero with a standard deviation of 2 n in using, W i is the initialization distribution; n i n is the number of input units in the weight tensor.
He initialization generally works better on ReLU and PReLU activation functions.

Random Normal Initialization
One of the most commonly used initialization is the random normal, where all the weight metric values will be initialized as random numbers. Although this type of initialization is susceptible to vanishing gradients or exploding gradient we used in this experiment as mentioned in [38], random weights perform well at times.

Optimizers
Optimizers play an important role in minimization during the training phase. Relating to the loss function, the optimizers deals with molding the model in its best possible ways. We have used the following optimizers for our experiments mentioned in this work.

RMSprop
RMSprop is an adaption of Rrop algorithm [39] to the mini-batch learning rate. RM-Sprop is also similar to Adagrad [40], but RMSprop deals with the radically diminishing learning rates occurring in Adagrad. RMSprop divides the learning rate for weight by a running average of the magnitudes of recent gradients for that weight. RMSprop keeps the moving average of the squared gradients for each weight and divides the gradient by square root the mean square.
where E[g] is moving average of squared gradients, g t is gradient of the cost function with respect to the weight, and η is the learning rate.

Adam
Adaptive Moment Estimation (Adam) [41] method is another adaptive learning rate method. Like Adadelta and RMSprop, Adam stores an exponentially decaying average of past squared gradients v t . In addition, Adam also keeps an exponentially decaying average of past gradients m t , similar to momentum. mt and v t are estimates of the first moment and the second moment of the gradients respectively. The Adam update rule is given by, where m t and v t are bias corrected estimates of the first moment and the second moment of the gradients respectively.

Loss Functions
In loss functions Softmax-Cross-Entropy and Dice loss were used.

Softmax-Cross Entropy Loss
The softmax-cross entropy loss is a combination of softmax activation function and cross-entropy (CE) loss. The CE loss is defined as where t i is the ground-truth and s i is the score for each class i in C. In softmax-cross entropy loss, the softmax activation function is applied to the scores before the CE loss computation. So, where f (s) i is the softmax activation of the score which is given by,

Dice Loss
Dice loss is based on Sørensen-Dice coefficient (DSC), a statistic to estimate the similarity between two samples. The range of DSC is between 0 and 1, with 1 being the better. Thus 1-DSC is used to maximize the overlap between two sets.

Activation Functions
The parameter that largely contributes towards the making of a neural network model is the activation function chosen. The non-linearity behavior of the network is introduced by this mathematical functions that also decides whether the specific neuron should be fired or not. Activation functions permit the network models to compute arbitrarily complex functions. In this experiment, we decided to work with the popular activation functions such as Tanh, Sigmoid and Relu.

Tanh Activation Function
A widely used non-linear activation function that squashes the real-value to the range of [−1,1]. Figure 2 plots the graph of Tanh activation function and its derivative.

Sigmoid Activation Function
The monotonic activation function takes real values as input and outputs a value in the range [0,1] see Figure 3, gives a smooth gradient and is considered to be a good classifier.

ReLU Activation Function
Rectified Linear Units abbreviated as ReLU is often chosen for its ability of handling vanishing gradient problems with the range of [0, ∞]. Figure 4 shows the ReLU activation function and its derivative.

Dataset
For the experiments, we have tried to include different datasets for the purpose of giving a generalization for the proposed combinations. Thus we did multi-organ segmentation as a part of testing this proposed approach. Apart from the liver data, we also used pancreas and cardiac data.

Liver-LiTS
The Liver Tumor Segmentation Challenge (LiTS) [43] dataset contains in-total of 201 contrast-enhanced 3D abdominal CT scans and ground truth segmentation for liver and lesions. The resolution of the images is considered to be 512 × 512 in each axial slice. For training, there exist 131 scans with ground-truth labels and 70 that can be used for testing. The slice spacing ranges from 0.45 mm to 5.0 mm and the in-plane resolution from 0.60 mm to 0.98 mm.

Medical Segmentation Decathlon
Medical Segmentation Decathlon(MSD) challenge datasets [44] consists of 10 different semantic segmentation tasks. Our experiments are based on only liver, pancreas and cardiac datasets. The liver dataset has 131 labeled volumes with two labels for segmenting liver and tumor from CT modality. The pancreas dataset has 282 labeled volumes for segmenting pancreas and tumor from CT modality. The cardiac dataset has 20 labeled volumes for segmenting the left atrium from MRI modality. The resolution of the liver and pancreas dataset is 512 × 512 and the resolution of the cardiac dataset is 320 × 320 pixels in each axial slice. All the datasets are scaled to an isotropic resolution of 1 × 1 × 1 mm and normalized to have zero mean and unit variance. The ground truth labels are binarized in the liver and pancreas dataset to only have liver and pancreas labels respectively.

Experiment 1
For the experiment, we used the implementation of VGG-16 implemented in Tensorflow for the purpose of segmenting the liver parenchyma in axial CT images. We have varied the activation functions in the hidden layers by applying the sigmoid 1 (1+e −x ) , hyperbolic tanh(x) and reLU functions. For the final layer we have applied Softmax function s(x) k = e x k ∑ n j=1 e x j . The initial learning rate (lr) used here is 1 × 10 −4 and the weight decay is 0.0002. All the experiments are done using NVIDIA Tesla V100 (Nvidia, Sabta Clara, CA, USA) using public datasets SLIVER07 for the purpose of training and testing.

Experiment 2
In this experiment, the chainer implementation of the 3D U-Net is used to segment the liver parenchyma and pancreas parenchyma from CT volumes and left atrium from MRI volumes. The 3D patches of 64 × 64 × 64 were used as the input to the network. The different combinations of weight initialization methods along with different loss functions and optimizers have been experimented with. We used Glorot, He and LeCun initialization methods, in combination with loss functions including Softmax cross-entropy and Dice loss, and optimizers including Adam and RMSProp. In total, 12 different combinations have been experimented with for each dataset. We used the initial learning rate of 0.0001 and ReLU activation for all the combinations of this experiment. The Medical Segmentation Decathlon (MSD) challenge datasets were used for the purpose of training, validation and testing in this experiment. The models for all the combinations were trained on the NVIDIA DGX2 server with Tesla Volta GPUs.
As both the experiments were done simultaneously, we employed two different servers for handling the computations. Also, in our experiments, we used 70% data for training, 20% for validation and 10% for testing.

Segmentation Evaluation Methods
In order to compare different configurations, we have selected eight evaluation metrics. These include spatial overlap-based assessment methods like DICE, spatial distance-based metrics like Hausdorff Distance (HD), Average Hausdorff Distance (AVD) and Mahalanobis distance (MD), information theoretic-based measures like Mutual Information (MI) and Variation of Information (VOI), probabilistic measure like Area under ROC curve (AUC) and finally volume-based called Volumetric Similarity (VS). The selection of these metrics is based on the target of the segmentation methods being applied for this study based on the recommendations given in [35].

Dice Coefficient (DICE)
The Dice coefficient (DICE) is the most commonly used metric for validation of medical image segmentation [35]. It is used to find the overlap between the ground-truth segmentation S g and the test segmentation S t using where |S g | and the |S t | are the cardinalities of the two sets.

Hausdorff Distance (HD)
The Hausdorff Distance (HD) [45] is a spatial-distance based metric used to evaluate dissimilarity between two segmentation contours. Like other distance-based measures, the spatial distance is measured using spatial positions of the voxels. For two finite point sets, HD is defined in terms of directed Hausdorff distance h(A, B) as where with ||.|| is some norm like the Euclidean distance. A smaller value of HD implies better segmentation results.

Average Hausdorff Distance (AVD)
One of the drawbacks of HD is that it is sensitive to outliers. Average Hausdorff Distance (AVD) [45], as the name suggests, is the average of HD for all points. AVD is generally more stable and is defined as where d (A, B) is the directed average Hausdorff distance defined as 3.9.4. Mahalanobis Distance (MHD) Mahalanobis Distance (MHD) [46] uses the means of two comparing point clouds (segmented images) µ A and mu B and their common covariance matrix S to give the following distance measure where common covariance matrix S is given as In the above equation, S 1 and S 2 are the co-variance matrices of sets of voxels with α 1 and α 2 number of voxels respectively.

Mutual Information (MI)
In information theory, Mutual information (MI) between two random variables provides a measure of the amount of information that can be obtained about one variable by looking at the other. It can also be used to find similarity between two segmentation [47]. It is linked to the marginal entropies H(S g ) and H(S t ) and the joint entropy H(S g , S t ) of the two variables, i.e., segmented images S g and S t and is defined as 3.9.6. Variation of Information (VOI) This is another information-theory based measure. The Variation of Information (VOI) [48] is based on marginal entropies and MI and provides a measure of gain or loss in information when changing from one variable to another. It is defined by the following equation 3.9.7. Area under ROC Curve (AUC) Receiver Operating Curve (ROC) is a plot of True Positive Rate (TPR) against False Positive Rate (FPR). In the case of segmentation, TPR refers to the ratio of positive (foreground/segmented) voxels identified correctly out of the total number of positive voxels in the ground-truth. Similarly, FPR refers to the ratio of voxels identified incorrectly as positives out of the total number of negative (background) voxels in the ground-truth. The area under ROC curve (AUC) is a measure of separability for a classifier telling how well it is in distinguishing between classes (positive and negative voxels). Based on the definition by [49], AUC is defined as where FP, TN, FN and TP refer to as False Positive, True Negative, False Negative and True Positive respectively.

Volumetric Similarity (VS)
Volumetric Similarity (VS) is used to compare the volume of segmented regions in the two images. The volumes used for comparison are the absolute ones and not only of the overlapped regions. It is evaluated by subtracting from 1 the volumetric distance which is defined as the absolute difference between two volumes divided by the sum of the two [50] VS

Experimental Results and Discussion
We performed segmentation of different organs data using two different networks 3D-UNet and VGG-16. The former has been used for the segmentation of cardiac, liver and pancreas whereas the latter for the segmentation of the liver alone. The 3D-UNet model used for the segmentation of different organs has been tested with a combinatorial approach of weight initialization methods together with loss functions and optimizers. For the liver segmentation model, we used the approach of combining weight initialization with activation functions. From both the experiments we tried to gather information using different evaluation metrics. This has been done, as it is hard to set an optimal parameter value that says the segmentation obtained from the particular combination works better. So we decided to choose multiple sets of quality assessment metrics. Figures 5 and 6 shows the predictions from both best combinations that gives high Dice score and those that gives lower values of Dice score displayed using ITK-SNAP Viewer [51]. These visualizations of the results conveys the significant difference of each choices made. It is possible to view the qualitative results from these experiments on ITK-SNAP in four different views axial, coronal, sagittal and the 3D view of predictions (see Figure 7).

Comparison and Discussion of Results
In order to assess and compare the results of segmentations resulting from different combinations of hyper-parameters, we have used the eight evaluation metrics described in Section 3.9. The metrics have been evaluated using the VISCERAL evaluation software package [35]. Tables 1-3 show the mean values of metrics for each configuration applied to liver, heart and pancreas databases respectively. These configurations vary in initialization, loss function and optimizer used. Each configuration in these tables is represented using the notation like Init  Table 4 on the other hand gives a comparison combining weight initialization choices with combination of activation functions. The configurations we used in the table are represented using notations Init activation , whereas the initialization constitutes Init = {Glo, RandNorm, He} and activation functions consist of activation = {tanh, relu, sigm}. We have represented all the parameters used in this study in Table A1.   Figure 6. Axial and 3D views of Liver prediction overlayed on ground truth segmentation (1a) using best combination (Glo tanh ), (1b) using worst combination (He sigm ) viewed in ITK-Snap Viewer [51].  Figure 7. Segmentation results of Liver in axial, coronal, sagittal and 3D view using Glo tanh with best Dice score viewed in ITK-Snap Viewer [51].  It is important to note that in each of Tables 1-3 only those configurations have been included for which there was a segmentation result obtained. Hence configuration Glo Adam DC for the liver database and all Dice-loss-based configurations for the pancreas database have not been considered. The values in bold in these tables highlight the best two values for each metric whereas those in italics with an underline depict the worst value. Overall we see small (but mostly statistically significant) differences in the values of most of the metrics. However, from Table 1  give the worst results for at least four of the total eight metrics.
These conclusions can further be verified using the box plots for all metrics as illustrated in Figure 8. From the box plot of DICE, it can be observed that the results for both Glo Adam CE and Lec Adam CE are more consistent having a small inter-quartile range and are also uniformly spread around the median value. The outliers for these two configurations are also fewer and less far away from the minimum value. The same trend of a smaller interquartile range can also be seen for other metrics like AVD, MHD and AUC. Contrary to that, looking at the configurations which have the worst mean values, they tend to have a larger spread for most metrics. The results further indicate that the use of the Cross-entropy loss function has a better performance as compared to DICE for liver segmentation. This can be easily verified from the table and the box plots if we compare each pair of configurations having the same initialization and optimizer function but differing in the loss function.
Even for the segmentation results for the pancreas, the use of DICE as a loss function failed to give any results for any configuration. From amongst the remaining configurations, we observe that Lec Adam CE again performs amongst the best as can be seen in Table 3. However, in this case, it is accompanied as the second-best by configuration using He initialization with RMSProp optimization, i.e., He Rms CE . The box plots in Figure 9 also illustrate that for the majority of metrics these two configurations have a smaller spread as compared to the others.
For the heart database, we can see from Table 2 that the configuration He Adam CE gives the best results. Two other configurations of Glo Rms CE and He Rms DC also fare better than the other configurations in terms of at least three metrics excluding the non-discriminant metrics of MI and VOI. From Figure 10, it is visible that He Adam CE gives good consistency and better values for DICE, HD, AVD, AUC and VS. Glo Rms CE has a good performance with HD, AVD, MHD and VS whereas He Rms DC gives better values and consistency for DICE, AVD, AUC and VS. Hence, for the heart database we see a discrepancy that a configuration with DICE as loss function is also amongst one of the better configurations. However, it is important to note here also that the testing dataset for the heart database was also very small as compared to the other two and it could be interesting to see the results with a bigger dataset. Table 4 shows a comparison of the two configurations on LiTS dataset which differ in initialization and activation functions. Amongst the two configurations, we can clearly observe that Glo tanh outperforms the RandNorm relu and He sigm configuration.    In order to verify the statistical significance of the differences between the multiple configurations, we have further performed a paired t-test of the best configuration with each of the other configurations for each dataset. This test is performed for each of the eight quality metrics used. The null hypothesis for the paired t-test is rejected when p ≤ 0.05.
However, due to the limited data in the case of Heart database, we haven't performed statistical analysis on its results. Table 5 shows the comparisons with Glo Adam CE configuration for the Liver Dataset. The checkmarks in the table denote that the p-value is less than or equal to 0.05 implying statistical significance in differences. From the table, we can see that except for VS and HD all the metrics show that by selecting different loss functions, initializations and optimizers, the improvements in the output are significant. Moreover, as expected, for the Le Adam CE , which was the second best configuration and hence closer to the Glo Adam CE in terms of results, we see none of the metric values to be significantly different except MI. Additionally, we also see from Table 5 that for all initializations with a combination of CE loss and Rms optimizer, both DICE and AUC do not change significantly but the AVD, MHD, MI and VOI provide an insight into the significant changes in outcomes between these configurations. Similarly, Table 6 shows the statistical analysis results for the Pancreas dataset with checkmarks highlighting statistical significance in comparison to the best Lec Adam CE configuration. Here again we can observe that both HD and VS metrics mostly do not show significant differences except for the case where the best and the worst configurations are compared to each other. The p-value of less than 0.5 for all the rest of the metrics in most cases signify that the choice of parameters does have a noticeable impact on improving the segmentation results for Pancreas database. Moreover, with similar results as for the case of the second best configuration of He Rms CE , the null hypothesis is not rejected for the majority of the metrics used.
Finally, we have performed paired t-tests on the values in Table 4 for the comparison of configurations varying in initialization and activation functions. The comparison was performed between each configuration and the best one, i.e., Glo tanh . Table 7 shows the p-values for all the metrics. From the table, we can clearly see that for all the metrics the p-values are much lower than the significance value of 0.05. This suggests that the null hypothesis is rejected and there are significant differences between the best configuration and the rest in terms of segmentation results.  As we can see, the configurations used for both the experiments are different from each other and we employed 3D-UNet for multi-dataset segmentation as well as VGG-16 based segmentation model for a single dataset study. We have seen in the literature [11][12][13][14], the most promising successor of FCN is 3D-UNet. Used widely for multi-class segmentation [52,53] and often referred to as a universal segmentation model, we decided to follow the same idea of employing 3D-UNet for multi-dataset segmentation. We used different combinations of the parameters including weight initialization, loss functions and optimizers keeping ReLU as the activation function considering two main reasons.According to the literature, ReLU performed well with the U-Net architecture [54,55] and is considered to be six times faster than sigmoid/tanh activation functions. Here, as we were handling multi-dataset we need to reduce the computational cost, hence followed the principle of keeping ReLU as the activation function for all the experiments carried out in 3D-UNet. For the single dataset study on VGG-16 based segmentation model, we followed the combinations or activation functions combined with weight initialization keeping the loss function constant (CE). Although, VGG-16 is considered to be best for classification tasks, here we tried to come up with a solution for liver segmentation together with a study of parameter choices. "Training algorithms for deep learning models are usually iterative in nature and thus require the user to specify some initial point from which to begin the iterations. Moreover, training deep models is a sufficiently difficult task that most algorithms are strongly affected by the choice of initialization" [56], as mentioned in this statement, the motivation was to propose a better initialization scheme that works with the activation function. From the observations from the quality matrices, we could agree that the tanh activation function can be a good alternative to sigmoid and works well with Glorot/Xavier weight initialization [55].

Conclusions
In this research work, chainer implementation of 3D-UNet and VGG-16 networks were applied for segmentation tasks of the medical image dataset. The main observations from the experiments conducted show that perhaps there are interactions between the architectural parameters that enhance the output scores. The research work has been concentrated more on initialization schemes rather than the famous hyper-parameter searching techniques. We made two different approaches, a multi-dataset segmentation study and single-dataset segmentation evaluation. From both the experiments, we propose few of the combinations that may work better for segmentation results although it is hard to make a concise conclusion from the values in quality matrices. In best of our knowledge, there is no best algorithm proposed for the purpose of generalised medical image segmentation, but studies show that there can be best choices we can make while designing the architecture [22][23][24]. From our experimental results, we have observed that two of the combinations with Xavier weight initialization (also known as Glorot), Adam optimiser, Cross Entropy loss (Glo Adam CE ) and LeCun weight initialization, cross entropy loss and Adam optimiser Lec Adam CE worked best for most of the metrics in 3D-UNet setting, while Xavier together with cross entropy loss and Tanh activation function (Glo tanh CE ) worked best for VGG-16 network. The quantitative and qualitative analysis performed during the development of this work (see Tables 1-7 and Figures 5 and 6) shows the significant importance of the proposed combinations for the future development and designing of network architectures. We believe that this research can provide new perspective for the related researches in the medical domain, and also help fellow researchers to choose appropriate combinations for their network structure and to be aware of the possible challenges and the solutions. As an extension to this work, we would like to experiment with the combinations with better results on a multi-domain network that works for segmentation analysis in data from two different domains, for example, CT/MR images.