A Residual-Learning-Based Multi-Scale Parallel-Convolutions-Assisted Efﬁcient CAD System for Liver Tumor Detection

: Smart multimedia-based medical analytics and decision-making systems are of prime importance in the healthcare sector. Liver cancer is commonly stated to be the sixth most widely diagnosed cancer and requires an early diagnosis to help with treatment planning. Liver tumors have similar intensity levels and contrast as compared to neighboring tissues. Similarly, irregular tumor shapes are another major issue that depends on the cancer stage and tumor type. Generally, liver tumor segmentation comprises two steps: the ﬁrst one involves liver identiﬁcation, and the second stage involves tumor segmentation. This research work performed tumor segmentation directly from a CT scan, which tends to be more difﬁcult and important. We propose an efﬁcient algorithm that employs multi-scale parallel convolution blocks (MPCs) and Res blocks based on residual learning. The fundamental idea of utilizing multi-scale parallel convolutions of varying ﬁlter sizes in MPCs is to extract multi-scale features for different tumor sizes. Moreover, the utilization of residual connections and residual blocks helps to extract rich features with a reduced number of parameters. Moreover, the proposed work requires no post-processing techniques to reﬁne the segmentation. The proposed work was evaluated using the 3DIRCADb dataset and achieved a Dice score of 77.15% and 93% accuracy.


Introduction
Recent advancements in machine learning and computer communication help to improve smart healthcare systems, especially using IOMT devices [1,2]. A multimediabased medical diagnostic system is one of the primary necessities in the medical healthcare sector. These intelligently designed diagnostic systems [3] provide solutions that are further helpful for radiologists and physicians [4]. Liver cancer is listed as the sixth most widespread cancer across the globe. Hepatocellular carcinoma (HCC) is the primary cancer type, causing about 700,000 deaths on a yearly basis [5]. The most observed ground cause of primary liver cancers tends to be cirrhosis. Cirrhosis is usually caused by consuming an excessive amount of alcohol, hepatitis B and C viruses, and other liver diseases, which betide because of weight gain.
Cirrhosis can be diagnosed with the help of imaging testing approaches, such as MRI, CT, or ultrasound. CT is a very well-known image testing technique, as it provides comprehensive cross-sectional abdominal images and is used for liver tumor segmentation [6].

•
A complete end-to-end algorithm for segmenting tumors directly from CT scans with no post-processing step.

•
In our algorithm, we utilized the MPCs to extract multi-scale features of different-sized and -shaped tumors. • The residual learning approach was also employed in our algorithm by using residual connections and residual blocks, which helped with extracting deep features.
The rest of this paper is organized as follows: Section 2 provides an extensive literature review, Section 3 discusses the details of our proposed method, Section 4 presents results and discussion, followed by the conclusion in Section 5.

Related Work
The literature presents a diverse range of techniques and methodologies for liver tumor detection and segmentation. Christ et al. [21] proposed a model for liver and lesion segmentation using cascaded deep neural networks and 3D conditional random fields. The two cascaded U-Net models were used to perform liver and tumor segmentation and the resultant outputs were passed on to 3D conditional random fields with a DSC of 0.943 for liver segmentation. Sun et al. [22] proposed a multichannel fully convolutional network (FCN) for contrast-enhanced multiphase CT scan images. A single channel of the FCN consists of eight convolutional layers, three subsampling layers, three deconvolution layers, and two feature fusion layers. The convolutional layers used varied kernel sizes and acquired features from the image by keeping spatial correlations. All fully convolutional network channels were passed through an independent training phase and achieved a volumetric overlap error (VOE) of 8.1 ± 4.5. Chlebus et al. [9] also proposed a modified U-Net model for liver tumor segmentation using short skip connections for parameter renewals and speed enhancement of the model. The output was subjected to post-processing using a shape-based method and achieved a DSC of 0.58 using the random forest technique.
Liu et al. [23] modified the existing research work of Christ et al. [21] and Chlebus et al. [24] with their proposed GIU-Net by combining U-Net with a graph cut algorithm. They increased the depth of structure and made skip connections from the pooling layers output, combining it with a graph-cut approach while achieving a DSC score of 0.9505. Later, Li et al. [25] contradicted Liu et al. [26] and a similar approach presented by other researchers. They primarily focused on FCN-8's structure during the segmentation phase. The proposed model in this research had four major max-pooling layers with two skip structures to merge the final two outputs of the max-pooling layer with the parallel upsampling layer. There have been two additional skip connections for the integration of residual outputs of the max-pooling layer with a parallel up-sampling layer. The expected accuracy of 0.994 was not achieved due to the noise present in the input. Budak et al. [27] also presented two cascaded encoder-decoder CNNs for liver tumor segmentation and used an EDCNN algorithm with two symmetric encoder and decoder parts. The two parts had ten convolutional layers with batch normalization and ReLu activation, followed by a max-pooling layer. In the next step, segmentation was performed using two cascaded deep neural networks, with one focused on the liver and the other on the tumor. The output of the former network was forwarded as the input to the latter network. The DSC values of 0.9522 and 0.634 were gained in the liver and tumor segmentations, respectively.
All the previous research works employed deep-learning-based architectures for the efficient segmentation of liver tumors. It is generally observed that tumor segmentation is performed after liver ROI extraction, which requires the model to be trained in two stages. In the first stage, the model is trained to segment the liver from the whole CT scan and in the second stage, they again trained the model to extract the tumor from the extracted liver ROI. Moreover, they also pass the output of their models to post-processing techniques and methods to improve the segmentation and performance accuracy. By considering these issues, we propose a complete end-to-end segmentation model that is capable enough to segment the liver tumors directly from a CT scan and requires no need for any preprocessing techniques. In our proposed approach, there is no need to first segment the liver for tumor segmentation and our method can assist radiologists in better treatment planning by providing early and accurate tumor detection.

Materials and Methods
The main flow of the proposed methodology is shown in Figure 1, which includes the dataset extraction, followed by the preprocessing of the CT scan.

Dataset Extraction
This is a basic experimentation step in research work that impacts the overall system performance. In this research work, we have used the 3Dircadb dataset, which is also known as the 3D Image Reconstruction for Comparison of Algorithm Database [28]. There are a total of 20 folders with tumor CT scans from multiple European hospitals. To be more specific, the 3Dircadb dataset comprises CT scans of 20 patients diagnosed with a hepatic tumor in 75% of cases. Patient images are present in DICOM format, along with the corresponding label images and ROIs. The total number of CT slices in each 3D image varies from patient to patient. Moreover, there are some slices in which the tumor is not present, and we also consider those slices in our experimentation. The size of the 2D CT scan slice that was used to train the algorithm is 256 × 256 × 1. The details of the dataset are presented in Table 1.

Preprocessing
Generally, medical imaging datasets have a noisy texture that causes the ROI to fade out. The noise could include any kind of blotches, irregular spots, unwanted objects, and organs. A medical imaging dataset needs to be preprocessed first to make it suitable for further experimentation. This step is mandatory to achieve enhanced images, as raw data is noisy most of the time and cannot be processed further. It is very important to enhance ROIs by eliminating unwanted noise; for this purpose, various researchers have proposed multiple techniques. Mostly, contrast enhancement is used to improve the image quality. This is done using windowed Hounsfield unit values in the range [-100, 400], which results in an enhanced image without any kind of noisy blotches, irregular spots, organs, and unwanted objects. We have used this preprocessing step over the dataset to enhance the visibility of ROI and achieve a better image quality. This preprocessing technique is also followed by other researchers [27,29]. Figure 2 shows some samples from the 3DIR-CADb dataset before and after the enhancement operation was applied.

Dataset Extraction
This is a basic experimentation step in research work that impacts the overall system performance. In this research work, we have used the 3Dircadb dataset, which is also known as the 3D Image Reconstruction for Comparison of Algorithm Database [28]. There are a total of 20 folders with tumor CT scans from multiple European hospitals. To be more specific, the 3Dircadb dataset comprises CT scans of 20 patients diagnosed with a hepatic tumor in 75% of cases. Patient images are present in DICOM format, along with the corresponding label images and ROIs. The total number of CT slices in each 3D image varies from patient to patient. Moreover, there are some slices in which the tumor is not present, and we also consider those slices in our experimentation. The size of the 2D CT scan slice that was used to train the algorithm is 256 × 256 × 1. The details of the dataset are presented in Table 1.

Preprocessing
Generally, medical imaging datasets have a noisy texture that causes the ROI to fade out. The noise could include any kind of blotches, irregular spots, unwanted objects, and organs. A medical imaging dataset needs to be preprocessed first to make it suitable for further experimentation. This step is mandatory to achieve enhanced images, as raw data is noisy most of the time and cannot be processed further. It is very important to enhance ROIs by eliminating unwanted noise; for this purpose, various researchers have proposed multiple techniques. Mostly, contrast enhancement is used to improve the image quality. This is done using windowed Hounsfield unit values in the range [−100, 400], which results in an enhanced image without any kind of noisy blotches, irregular spots, organs, and unwanted objects. We have used this preprocessing step over the dataset to enhance the visibility of ROI and achieve a better image quality. This preprocessing technique is also followed by other researchers [27,29]. Figure 2 shows some samples from the 3DIRCADb dataset before and after the enhancement operation was applied.

Architecture
In this section, we explain our proposed novel segmentation architecture for efficient liver tumor segmentation. The proposed architecture mainly consists of a down-sampling path, a bottleneck path, and an up-sampling path. Each of these paths employs the use of multi-parallel convolution blocks (MPCs) and Res blocks. The architecture of the proposed algorithm is shown in Figure 3.

Down-Sampling Layers
The down-sampling path starts by using the CT scan image with size 256 × 256 × 1 as the input to the multi-scale parallel convolution block (MPC), as shown in Figure 3, followed by the Res block and max-pooling operation of size 2 × 2 to reduce the spatial dimensions of the given CT scan. This process is defined in Equation (1): Samples of raw and enhanced CT scan images from the 3DIRCADb Dataset.

Architecture
In this section, we explain our proposed novel segmentation architecture for efficient liver tumor segmentation. The proposed architecture mainly consists of a down-sampling path, a bottleneck path, and an up-sampling path. Each of these paths employs the use of multi-parallel convolution blocks (MPCs) and Res blocks. The architecture of the proposed algorithm is shown in Figure 3.

Down-Sampling Layers
The down-sampling path starts by using the CT scan image with size 256 × 256 × 1 as the input to the multi-scale parallel convolution block (MPC), as shown in Figure 3, followed by the Res block and max-pooling operation of size 2 × 2 to reduce the spatial dimensions of the given CT scan. This process is defined in Equation (1): In Equation (1), a neuron y i k.w is present on a position (k, w) at the ith output map of the downsampling layer. In the ith input map x i , a neuron y i k.w is assigned with a maximum value in region p × p.
Multi-Scale Parallel Convolution Blocks (MPCs): The architecture of the MPC contains parallel convolutions utilizing different filter scale sizes of 1 × 1, 3 × 3, and 5 × 5 followed by ReLu [30] activations, defined as in Equation (2): The output of these multi-parallel convolutions is added and given as an input to the Res block. The utilization of MPCs is a very powerful unit and it learns at different scales as it employs the use of parallel convolutions that are capable of extracting features of different-sized tumors. The MPC is used after every max-pool operation, except the first MPC. The feature maps of different convolutions are calculated using the following Equation (3): In the above Equation (3), the input image is denoted by m, while the kernel or filter is represented by n. The indexes of the matrix's rows and columns are denoted by x and y. Moreover, the architecture of an MPC is illustrated in Figure 4.  Therefore, the mathematical formulation is represented by Equation (4): output value x j is used to represent the position of the jth output channel. The weight matrix between x i and x j is represented by w ij , while the bias term is represented by b j . Therefore, the mathematical formulation is represented by Equation (4): This 1 × 1 convolution acts as a projection layer and decreases the number of filters or kernels at the end layer and increases them at the first layer. This approach is known as the projection shortcut utilized by [24] and can be defined as in Equation (5): The input and output layer vectors are represented by x and y. A residual mapping that is to be learned is represented by the term F(x, {W i }). In our Res blocks, there are two layers F = W 2 σ(W 1 x), as shown in Figure 3 part 2, in which the ReLu activation function is represented by the term σ. Therefore, with the help of an addition and shortcut connection, an F + x operation is computed. In Equation (4), this shortcut connection does not introduce any extra parameter in the network. In Equation (5), the addition between F and x is only performed if their dimensions are equal. The shortcut connection performs a linear projection in the case when the dimensions are unequal. The linear projection is denoted by W s and it is given by Equation (6): Our down-sampling path follows the same pattern four consecutive times, followed by a max-pool and dropout layer of 0.05 to prevent overfitting of the model. The output of the multi-parallel convolution block is also added to the output of the Res block. Moreover, the filter size for each of our convolution blocks is 16, 32, 64, and 128, respectively.

Bottleneck Layer
The bottleneck layer of our proposed architecture consists of a multi-scale parallel convolution block (MPC) followed by Res blocks, as shown in Figure 3 part 1. The output of the last max-pooling layer in the down-sampling path is given as an input to the MPC. It contains filters of different sizes, whose outputs are added and given as the input to the Res block. The output of the bottleneck layers is given as the input to the transposed convolution layer, which is the first up-sampling layer. The size of the feature map, which is given as an input to the bottleneck layer, is 16 × 16 × 256, with the total number of filters set to 32.

Up-Sampling Layers
The up-sampling layers utilize the transposed convolution of size 3 × 3 and stride of 2 × 2. The transposed convolution serves as a deconvolution layer and performs the up-sampling of images with proper learning instead of the simple up operation, which only doubles the dimension of the input image without any weights. They are also known as fractionally strided convolutions. Suppose that if a convolution is applied from left to right on inputs and outputs which are unrolled into vectors by kernel w and stride of one unit without padding, then we have a matrix called a sparse matrix C through which convolution can be represented. In a sparse matrix C, the non-zero elements of the kernels are represented by W ij . On the other hand, if the transpose of a sparse matrix C is obtained, then a backward pass of the convolution operation is easily attained. The error is backpropagated and the transpose of a sparse matrix is multiplied by the loss. A convolution is defined by kernel w, whose forward and backward passes are calculated by taking the product of the sparse matrix C and its transpose C T . Similarly, the forward and backward passes of transposed convolution defined by kernel w are computed by multiplying the sparse matrix C and C T T . The pattern of up-sampling layers consists of transposed convolutions followed by skip connections, MPCs, and Res blocks. The total number of filters in each transpose convolution layer is 128, 64, 32, and 16, respectively. The main purpose of these downsampling layers is to recover the size of the feature maps by adding spatial and contextual information to the segmentation image. We can transfer the contextual information from the down-sampling layers to the up-sampling layers with the help of skip connections. In the end, a convolution of size 1 × 1 followed by a sigmoid activation function is used to get the final segmented image of 256 × 256 × 1.

Skip Connections
The loss of low-level information may happen during the down-sampling of the image. The skip connections are used to recover the information that is lost during down-sampling and to let the up-sampling layers retrieve the low-level features. This can be achieved via a concatenation operation between the up-sampling layers to the outputs of the downsampling layers to combine the contextual information for localization. A dropout of the same rate is utilized after the concatenation operation, followed by multi-scale parallel convolution blocks (MPCs) and Res blocks with shortcut connections, respectively.

Training Details and Hyperparameters
The hyperparameters of our proposed model include the learning rate, which was 0.001, with weight optimizer adaptive learning optimization (Adam). It utilizes the momentum term along with stochastic gradient descent and RMSprop. The Adam updates the weights of the network using Equation (7): In the above-mentioned equation, the weights of the model are represented by W, and η represents the step size, where its value depends upon iterations of the network, while the values ofm t andv t are computed using the equations mentioned below: In the above-mentioned equations, the values of β 1 and β 2 are 0.9 and 0.999, respectively. During network training, the error between the actual values and predicted values are computed with the use of a loss function named the binary cross-entropy loss. It is defined below: In the above equation, BCE stands for binary cross-entropy in which y i refers to the class of pixel predicted by the model, while P(y i ) represents the probability predicted by the trained model for all pixels in the background or foreground. The proposed model was trained with 150 epochs with an input batch size of 4 and an image dimension of 256 × 256 × 1.

Experimentation and Results
This section is divided into subheadings that provide a concise and precise description of the experimental results, their interpretation, and the experimental conclusions that can be drawn.

Dice Similarity Coefficient
DSC, or Dice similarity coefficient, is commonly used to calculate the similarity between two samples. In this research work, this performance measure determined the overlap between two binary masks. It can be mathematically defined as the size of the overlap between two segmentations divided by the total size of the two objects. The provided range of DSC is usually from 0 (no overlap) to 1 (perfect overlap). DSC is calculated using the following equation:

Jaccard Similarity Coefficient
JSC gives segmented image and binary mask values precisely. It is also defined as the ratio of similarity and diversity of samples used in experimentation. In mathematical terms, it is the ratio of the intersection between two binary masks with their union. JSC is calculated according to the equation given below:

Accuracy
Accuracy is one of the most significant performance measures that determine the efficiency and effectiveness of any model. Accuracy represents the ratio of correctly segmented samples to a total number of samples [31].

Symmetric Volume Difference
SVD provides the difference of the segmented images from the ground truth. If the value of SVD is zero, it represents a promising resultant segmentation value. The equation determines how to calculate SVD, where DSC is the Dice similarity coefficient:

Sensitivity
The correctly identified proportion of true positives is measured using sensitivity [32].

Specificity
The correctly identified proportion of true negatives is measured using specificity [32].

Matthew's Correlation Coefficient (MCC)
MCC is widely used for classification problems when the classes are highly imbalanced. It is also known as the "phi coefficient" and it is defined using Equation (13): In all the above equations, true positive (TP) represents pixels that form the foreground and are classified as foreground. True negative (TN) represents pixels that form the background and are classified as background. False negative (FN) represents foreground pixels that have been inaccurately classified by the classifier as background pixels. False positive (FP) represents background pixels that have been incorrectly classified as foreground pixels by the classifier.

Results and Discussions
In this proposed work, a novel deep learning algorithm was used to segment the liver tumors directly from the CT scan. The proposed model was validated on preprocessed images of the 3DIRCADb dataset. The whole dataset of 20 patients was divided into non-overlapping train and test sets with a random 80-20 split division. Moreover, the same experiment was executed ten times and the average results are reported to avoid any bias. Table 2 shows the results of our proposed model for segmenting the liver tumors. The Dice score achieved for our proposed model was 77.15% and the Jaccard score was 68.5%. The standard deviations are also given in Table 2. The other evaluation metrics, which include accuracy and SVD, were also calculated. The SVD shows the difference between the actual and predicted masks. The proposed model achieved an accuracy of 93% with an SVD score of 0.23, which was the very minimum difference between the actual and predicted masks, as shown in Table 2. The reason for the higher accuracy was a class imbalance. In a given CT scan image, more pixels belong to the background class, while the number of pixels where the tumor is present much fewer. Therefore, the accuracy value is biased toward the background class because accuracy counts the total number of TP, FP, TN, and FN of all classes. The Dice and Jaccard analysis scores more accurately represent the segmentation model capability. Moreover, the values of sensitivity, specificity, and the MCC were 76.5%, 79.56%, and 0.77 respectively. We also drew a comparison of our model with U-Net proposed by Ronneberger et al. [33]. The standard U-Net architecture is very famous for biomedical image segmentation and is extensively used by different researchers. The Dice and Jaccard scores achieved by U-Net on the 3DIRCADB dataset were 67.5% and 56%, which were very low in comparison with our algorithm. The other scores achieved by U-Net included sensitivity, specificity, and the MCC, which were 70.1%, 64.8%, and 0.69 respectively. The difference between the actual and predicted masks segmented by standard U-Net was 0.33, which is an SVD score. Our model found a 9.65% improvement in the Dice score and a 12.5% improvement in the Jaccard score.
Moreover, during training, the input image of the CT scan is passed through different layers of the model, which includes convolution layers and pooling layers. The output of each layer takes the form of activation maps. The visualization of those activation feature maps of some intermediate layers of our proposed model is also shown in Figure 5. These visualizations show how the model depicts the contextual information of the image layer by layer.
Moreover, some sample images that were segmented by our model are shown in Figure 6. In Figure 6, column (A) shows the original test set slice images, column (B) shows the actual masks of the tumor, column (C) shows the actual overlay results of the original mask over the image, column (D) shows the masks that were predicted by the model, while column (E) shows the predicted overlay results of a predicted mask on the images. It is observed from Figure 6 that the model found difficulty in segmenting very small tumors, as shown in the last row images. Similarly, the model also found difficulty in segmenting in the second row, as shown in columns (B) and (D) of Figure 6. All the scores were calculated with the help of predicted and actual masks, which are shown in columns (B) and (D) of Figure 6. Furthermore, the Dice and Jaccard scores of individual CT scan slices in the test set is shown in Figure 7. In Figure 7, the x-axis shows the number of CT scan slices, while the y-axis shows the dice and Jaccard score. It is observed from Figure 7 that, for most CT scans, the Dice score was above 80. specificity, and the MCC, which were 70.1%, 64.8%, and 0.69 respectively. The difference between the actual and predicted masks segmented by standard U-Net was 0.33, which is an SVD score. Our model found a 9.65% improvement in the Dice score and a 12.5% improvement in the Jaccard score.
Moreover, during training, the input image of the CT scan is passed through different layers of the model, which includes convolution layers and pooling layers. The output of each layer takes the form of activation maps. The visualization of those activation feature maps of some intermediate layers of our proposed model is also shown in Figure 5. These visualizations show how the model depicts the contextual information of the image layer by layer.  Figure 6, column (A) shows the original test set slice images, column (B) shows the actual masks of the tumor, column (C) shows the actual overlay results of the original mask over the image, column (D) shows the masks that were predicted by the model, while column (E) shows the predicted overlay results of a predicted mask on the images. It is observed from Figure 6 that the model found difficulty in segmenting very small tumors, as shown in the last row images. Similarly, the model also found difficulty in segmenting in the second row, as shown in columns (B) and (D) of Figure 6. All the scores were calculated with the help of predicted and actual masks, which are shown in columns (B) and (D) of Figure 6. Furthermore, the Dice and Jaccard scores of individual CT scan slices in the test set is shown in Figure 7. In Figure 7, the x-axis shows the number of CT scan slices, while the y-axis shows the dice and Jaccard score. It is observed from Figure 7 that, for most CT scans, the Dice score was above 80.     Moreover, we also checked the model loss over several epochs during the training of our proposed algorithm. Usually, when the model loss is near to zero or becomes constant over a certain number of epochs, then the model prediction is perfect. The loss curves of our proposed method are shown in Figure 8. Furthermore, the accuracy of the proposed model over epochs during training is also shown in Figure 8. Accuracy determines the correct number of predictions over all classes. The loss and accuracy curves of U-Net proposed by Ronneberger et al. [33] are also given in Figure 8. Moreover, we also checked the model loss over several epochs during the training of our proposed algorithm. Usually, when the model loss is near to zero or becomes constant over a certain number of epochs, then the model prediction is perfect. The loss curves of our proposed method are shown in Figure 8. Furthermore, the accuracy of the proposed model over epochs during training is also shown in Figure 8. Accuracy determines the correct number of predictions over all classes. The loss and accuracy curves of U-Net proposed by Ronneberger et al. [33] are also given in Figure 8.

Comparison with State-of-the-Art Approaches
The performance and results of our proposed approach are explained in the previous section. The use of an MPC gives multi-scale features, which were shown to be very beneficial for encoding information of different-sized tumors. To compare the performance of our proposed model with existing methods, a detailed comparative analysis was performed with existing methods. It was found from the literature that by using the 3DIR-CADb dataset, Christ et al. [29] achieved a 61% dice score using their proposed two FCNs in a cascaded manner. Alirr et al. [34] achieved a Dice score of 75% by utilizing the traditional method of adaptive thresholding to extract masks of liver tumors. Li et al. [35] and Z. Bai et al. [36] achieved Dice scores of 65% and 76.5%, respectively, by making some improvements upon the standard U-Net. Z. Bai et al. [36] also used an active contour

Comparison with State-of-the-Art Approaches
The performance and results of our proposed approach are explained in the previous section. The use of an MPC gives multi-scale features, which were shown to be very beneficial for encoding information of different-sized tumors. To compare the performance of our proposed model with existing methods, a detailed comparative analysis was performed with existing methods. It was found from the literature that by using the 3DIRCADb dataset, Christ et al. [29] achieved a 61% dice score using their proposed two FCNs in a cascaded manner. Alirr et al. [34] achieved a Dice score of 75% by utilizing the traditional method of adaptive thresholding to extract masks of liver tumors. Li et al. [35] and Z. Bai et al. [36] achieved Dice scores of 65% and 76.5%, respectively, by making some improvements upon the standard U-Net. Z. Bai et al. [36] also used an active contour model (ACM) to refine the tumors segmentation. Similarly, Budak et al. [37] achieved a Dice score of 63.4% Moreover, by looking into the recent work for liver tumors, S.-T. Tran et al. [38] proposed an improved U-NET based method by employing the architecture of dense and dilated convolution and achieved a very significant improvement regarding the Dice score. Similarly, H. Seo et al. [39] also proposed an improved U-NET that was based on segmenting both the liver and tumors. It was observed from the previous literature that the proposed model achieved a significant improvement in segmenting tumors directly from CT scans. The main reason behind the improvement was the utilization of MPCs and the concept of residual learning to obtain features without increasing the number of parameters in the network. The multi-scale features extracted using the MPC are added with the features maps of the Res blocks to better describe the tumor features. Furthermore, the previous work in the literature developed two-stage algorithms via first segmenting the liver, followed by a liver tumor segmentation. The previous methods use post-processing techniques to refine the tumor segmentation. It is necessary to mention here that the proposed approach follows an end-to-end mechanism to segment tumors in an efficient manner. The comparison results of previous techniques, standard U-Net, and our proposed approach are given in Table 3 in terms of Dice and SVD scores. Table 3. Comparison with the state-of-the-art approaches.

Authors
Dice Score

Conclusions
This research work highlighted problems in liver tumor segmentation and provided a solution to address those issues. Many researchers have previously presented two-step methods that carry out liver segmentation, followed by tumor segmentation. This is a time-consuming approach with a higher chance of inaccuracy. To solve these issues, we proposed a technique in this research work that is challenging as it directly segments out the tumor from a CT scan. However, it is an end-to-end segmentation algorithm and efficiently performs on the sample data and provides accurate results. The proposed work employs MPCs to encode multi-scale features of different tumor sizes. The incorporation of Res blocks is also helpful for encoding tumor features with a reduced set of parameters in the network. All of these characteristics increase the segmentation performance of our model. Moreover, our approach does not require post-preprocessing steps for the refinement of segmentation results. The proposed system was evaluated using the publicly available 3DIRCADb dataset and achieved excellent results as compared to existing published work. To ensure the validity of the proposed framework, we performed a comparative analysis with already existing techniques for liver tumor detection. In the future, we will apply attention gates to our model to further improve the performance.