Kidney and Renal Tumor Segmentation Using a Hybrid V-Net-Based Model

: Kidney tumors represent a type of cancer that people of advanced age are more likely to develop. For this reason, it is important to exercise caution and provide diagnostic tests in the later stages of life. Medical imaging and deep learning methods are becoming increasingly attractive in this sense. Developing deep learning models to help physicians identify tumors with successful segmentation is of great importance. However, not many successful systems exist for soft tissue organs, such as the kidneys and the prostate, of which segmentation is relatively di ﬃ cult. In such cases where segmentation is di ﬃ cult, V-Net-based models are mostly used. This paper proposes a new hybrid model using the superior features of existing V-Net models. The model represents a more successful system with improvements in the encoder and decoder phases not previously applied. We believe that this new hybrid V-Net model could help the majority of physicians, particularly those focused on kidney and kidney tumor segmentation. The proposed model showed better performance in segmentation than existing imaging models and can be easily integrated into all systems due to its ﬂexible structure and applicability. The hybrid V-Net model exhibited average Dice coe ﬃ cients of 97.7% and 86.5% for kidney and tumor segmentation, respectively, and, therefore, could be used as a reliable method for soft tissue organ segmentation.


Introduction
Developing countries are home to the most diverse cancer types, which can be explained by social and economic factors. However, lifestyle also has a great effect on these statistics [1]. Cancer statistics data reveal that more than 400,000 cases of kidney cancer were detected worldwide in 2018. Most people diagnosed with kidney cancer were between the ages of 60 and 70, with the statistics also indicating that the number of asymptomatic kidney tumors is increasing [2]. Smoking, obesity, and hypertension are among the determining risk factors for kidney cancer [3].
Kidney tumors can be divided into two distinct groups, namely, benign and malignant. Benign tumors are mostly harmless, but some may cause symptoms such as muscle pain or hematuria as the mass grows [4,5]. Malignant tumors are considered risky. The majority of these tumors are renal cell carcinomas (RCC) [6]. While kidney or tumor removal was an effective treatment method used in previous years, preventive treatment is gaining more importance thanks to the advanced imaging techniques available today [7]. Oncological treatments are not ignored, while promising studies focused on the prevention of unnecessary surgeries are also attracting attention [8]. In recent

Related Works
Xin Yang et al. [17] proposed a method for kidney segmentation which firmly provided segmentation accuracy for a wide variety of Dynamic Contrast Enhanced-Magnetic Resonance Imaging (DCE-MRI) data, stressing that very few manual operations and parameter settings were required for this approach. A five-step correction procedure was applied, with the authors reporting that the model was superior to other models, with an accurate segmentation rate of 95%.
Dehui Xiang et al. [18] proposed a method for automatic renal cortex segmentation, presenting an approach for the fully automatic identification of kidney and cortex tissues from CT scans. The method was tested on a dataset consisting of 58 CTs. Experimental results were found to be 97.86% ± 2.41% and 97.48% ± 3.18% for kidney and renal cortex segmentation, respectively.
Seda Arslan Tuncer and Ahmet Alkan [19] proposed a decision support system for the detection of renal cell cancer as the most common type of kidney cancer. They reported that the rapid spread of renal cell cancer and failure of early diagnosis often led to death. A machine learning-based decision support system was proposed to distinguish between healthy kidney cells and kidney cells with renal carcinoma, achieving a Dice coefficient segmentation success rate of 89.3% in their study, conducted using 130 datasets obtained from Fırat University. Guanyu Yang et al. [20] proposed a three-dimensional, Fully Convolutional Neural Network (FCNN) model for the automatic segmentation of kidney and renal tumors. They stated that renal cancer is one of the ten most common types of cancer and emphasized that the prerequisite for surgical planning was accurate renal and tumor segmentation on CT images, adding that this is still a problem in automatic imaging. A new fully convolutional network model (FCN) combining a three-dimensional (3D) pyramid pooling module (PPM) and a gradually enhanced feature module (GEFM) was proposed. The proposed network architecture was an end-to-end learning system using 3D volumetric images, whereby a structure with 3D information was used to improve the lesion of the tumor as well as the segmentation of the kidney. As a result of experiments on 140 patients, the target structures were shown to be successfully segmented. The average Dice coefficients obtained for kidney and renal tumors were calculated as 0.931 and 0.802, respectively. Florent Marie et al. [21] proposed an approach to segment deformed kidneys using CNN networks. In a medical context, segmentation provides surgeons with a lot of information but is rarely performed. These researchers are focused on kidneys deformed by nephroblastoma, proposing a new CNN assessment after different training sets for manual segmentation. An Over Learning Vector (OV2 ASSION) for Valid Sparse Segmentation was used to train the CNN. The study achieved a Dice coefficient rate of 89.7%.
Couteaux et al. [22] developed a 2D U-Net model based on computed tomography images. Segmentation of the kidney cortex was performed using the current U-Net models, with the authors reporting that the segmentation results of their algorithm matched the renal cortex with good precision, reaching a Dice score of 0.867, ranking them first in the data challenge. However, they emphasized that it would be more accurate to apply the process in 3D by measuring the renal cortex volume, thereby requiring labeling effort to train deep networks.
Antoniya et al. [23] recently performed a study on renal cyst segmentation using CT images. They reported making several innovations in the CT images to optimize renal cyst diagnosis using a new hybrid segmentation approach. The segmentation was based on several basic techniques, with the study based on the idea that an optimized prepropagation algorithm is the core of kidney segmentation in CT images. Color-based, k-means clustering algorithms were used, achieving a success rate of 92.12% for kidney segmentation and 91.24% for cyst segmentation.
Rundo et al. [24] developed a U-Net-based model for prostate segmentation, stating that prostate cancer is very common and its diagnosis with MRI is difficult. They proposed a novel CNN, called USE-Net, incorporating Squeeze-and-Excitation (SE) blocks into U-Net where the SE blocks were added after every Encoder (Enc USE-Net) or Encoder-Decoder block (Enc-Dec USE-Net). SE blocks can be defined as block structures formed by a series of operations of "residual + global pooling + sigmoid" functions. This model was compared with the classical U-Net models. The Enc-Dec U-Net model showed higher performance and achieved a better Dice coefficient than the other Enc U-Net and U-Net models. However, the effectiveness of SE blocks at certain stages remains open for discussion, and its contributions to the system should be further examined in terms of running speed. However, this developed model provides important clues regarding the development of new architectures in the future.
Fuzhe et al. [25] proposed a study using artificial neural networks. They tried to both reduce data size and increase the success of existing algorithms in various. The Heterogeneous Modified Artificial Neural Network (HMANN) was used for the early detection and segmentation of chronic kidney disease. These authors aimed to segment the region of interest of the kidneys in the ultrasound image and reported that the proposed HMANN method achieved 97.5% classification success and significantly reduced processing time.
Luana Batista da Cruz et al. [26] proposed an automatic method to delimit the kidneys in CT images using image processing techniques and deep CNNs to minimize false positives. They mentioned that the precise segmentation of kidneys and kidney tumors could help physicians to diagnose diseases and improve treatment. Manual segmentation of the kidneys was stated to be difficult, therefore presenting Chen Li et al. [27] developed a deep learning-based model (ANU-Net) segmentation network for medical image segmentation. They stated that an automated medical image segmentation model is required to help doctors diagnose and treat organ lesions. They also stated that medical segmentation is a challenging task due to the irregular shapes of the target organs. The proposed network model was stated to have a deeply controlled encoder-decoder architecture and a redesigned dense skip connection. ANU-Net creates the network structure with nested pleated blocks, then the extracted features can be combined with a selection. This ANU-Net model achieved four types of medical image segmentation tasks with a Dice similarity coefficient of 90.10%.
Nithya et al. [28] proposed a method for the detection and segmentation of kidney diseases using artificial neural networks. They emphasized that ultrasound imaging plays an important role in kidney stone detection and segmentation for surgery and treatment, adding that kidney stone segmentation in ultrasound images is often performed manually in clinical practice. Having eliminated noise in the input image, the authors classified it using artificial neural networks and finally segmented stones and tumors separately, with a success rate of 99.61%.
Wenshuai Zhao et al. [29] developed a 3D U-Net-based architecture for kidney and tumor segmentation. They reported that the segmentation was performed by the physicians by examining the CT images obtained during clinical analysis. They also argued that this process was difficult, and the system could fail in the case of lack of previous experience. The U-Net-based architecture was, therefore, developed to segment the kidneys, on the argument that a simpler architecture could be more successful than complex models. They tested this architecture, called MSS U-Net, in the KiTS19 challenge, finding kidney and tumor Dice coefficients of 0.969 and 0.805, respectively.
Isensee et al. [30] proposed nnU-Net, a deep learning framework condensing the current domain knowledge and autonomously making the key decisions required to transfer basic architecture to different datasets and segmentation tasks. The nnU-Net surpassed most specialized deep learning pipelines without manual tuning. This model is based on the principle of making the system more simple and orderly using a systematic approach by clearing the complex structure of the system without adding a new network structure. The authors stated that the model might have deficiencies in situations that require high performance, since the focus is only on the Dice coefficient. In contrast to state-of-the-art performance, some hyperparameters, such as missing functions, may need to be adjusted manually. For this reason, the nnU-Net model can be turned into a semiautomatic system by manually making additions; therefore, the deficiencies of the model can be eliminated by external intervention, thereby making the network performance more successful.

Image Preprocessing
In this study, 210 datasets were prepared for use, which are open to public access and can be downloaded through the cancer imaging archive page [31]. Additional explanations on the preparation of the dataset as well as the ethics committees are available on the main web page of the KiTS19 dataset [32]. Manual segmentation can cause a number of errors in the subsequent monitoring of the kidney or tumor. Additionally, it is time consuming and could slow system performance [33]. Despite these negative effects, we used the KiTS19 dataset because of the scarcity of available datasets in the literature. We prepared the clinical features of the existing patients, the imaging data, and the renal and tumor borders using the manual segmentation method. Figure 1 shows an example dataset prepared by the manual segmentation method.
KiTS19 dataset [32]. Manual segmentation can cause a number of errors in the subsequent monitoring of the kidney or tumor. Additionally, it is time consuming and could slow system performance [33]. Despite these negative effects, we used the KiTS19 dataset because of the scarcity of available datasets in the literature. We prepared the clinical features of the existing patients, the imaging data, and the renal and tumor borders using the manual segmentation method. Figure 1 shows an example dataset prepared by the manual segmentation method.  The imaging and ground-truth labels were presented in an anonymized nii file to image (NIFTI) format [34]. We resized the CT images in the dataset to 16 × 256 × 256 and divided the pixel value by 255 to normalize it between 0 and 1. The model parameters were initialized randomly, and no transfer learning was used. Patches of 64 × 128 × 128 in size were randomly sampled from the resampled volumes for training. The dataset consisted of 210 patients in total, with 190 in the training dataset. The remaining 20 were used for testing. These operations were arranged randomly. The model was trained by Adam Optimizer with a learning coefficient set at 0.001. The batch size was determined as 3, and the total epochs were set to 100,000. Training this model took about five days on the NVIDIA Tesla V100 (32 GB, NVLink) Graphic Processing Unit (GPU). We used the features of the TensorFlow library during the training. Figure 2 shows 3D volume rendering of the segmented regions (kidney and renal cancer in blue and purple) and also the 2D kidney and renal cancer images. In the image process phase, the CT image is analyzed to determine the slice thickness, window width, and position information. The kidney and renal tumor regions are preserved unchanged. In addition, original pictures and masks of these regions are created. The imaging and ground-truth labels were presented in an anonymized nii file to image (NIFTI) format [34]. We resized the CT images in the dataset to 16 × 256 × 256 and divided the pixel value by 255 to normalize it between 0 and 1. The model parameters were initialized randomly, and no transfer learning was used. Patches of 64 × 128 × 128 in size were randomly sampled from the resampled volumes for training. The dataset consisted of 210 patients in total, with 190 in the training dataset. The remaining 20 were used for testing. These operations were arranged randomly. The model was trained by Adam Optimizer with a learning coefficient set at 0.001. The batch size was determined as 3, and the total epochs were set to 100,000. Training this model took about five days on the NVIDIA Tesla V100 (32 GB, NVLink) Graphic Processing Unit (GPU). We used the features of the TensorFlow library during the training. Figure 2 shows 3D volume rendering of the segmented regions (kidney and renal cancer in blue and purple) and also the 2D kidney and renal cancer images. In the image process phase, the CT image is analyzed to determine the slice thickness, window width, and position information. The kidney and renal tumor regions are preserved unchanged. In addition, original pictures and masks of these regions are created.  Figure 3 shows the network structure of the classic V-Net architecture. The network architecture consists of the encoding and decoding portions, as in the basic U-Net architecture. Therefore, it is a derivation of U-Net architecture, except with a volumetric design, which is suitable for use in tissues where it is difficult to identify organs and tumors (such as prostate or kidney) on CT imaging [12].

V-Net Architecture
The V-Net architecture has a convolutional structure to extract features and reduce the resolution by following the right path. Classical pooling methods sometimes ignore important details during the segmentation process, so the V-Net convolutions are used to avoid this by downsampling, whereby the size of the data transmitted as input is reduced and the data are transmitted to the receiving properties calculated in the next network layers [13]. Each layer on the encoder side of the V-Net architecture consists of feature set calculation sections that are two times higher than the previous layer. The decoder section of the network aims to provide two-channel volumetric segmentation. For this reason, feature maps are provided in order to obtain the necessary information. After each layer in the encoder part of the network architecture, a deflection operation is performed to increase the size of the entries, with the same operations performed in the reverse direction to reduce the dimensions in the decoder section. The properties of the neural network removed from each stage of the encoder phase are transferred to the decoder phase. This is shown schematically in Figure 3 with horizontal connections [35]. Therefore, small details are able to be  Figure 3 shows the network structure of the classic V-Net architecture. The network architecture consists of the encoding and decoding portions, as in the basic U-Net architecture. Therefore, it is a derivation of U-Net architecture, except with a volumetric design, which is suitable for use in tissues where it is difficult to identify organs and tumors (such as prostate or kidney) on CT imaging [12].

V-Net Architecture
The V-Net architecture has a convolutional structure to extract features and reduce the resolution by following the right path. Classical pooling methods sometimes ignore important details during the segmentation process, so the V-Net convolutions are used to avoid this by downsampling, whereby the size of the data transmitted as input is reduced and the data are transmitted to the receiving properties calculated in the next network layers [13]. Each layer on the encoder side of the V-Net architecture consists of feature set calculation sections that are two times higher than the previous layer. The decoder section of the network aims to provide two-channel volumetric segmentation. For this reason, feature maps are provided in order to obtain the necessary information. After each layer in the encoder part of the network architecture, a deflection operation is performed to increase the size of the entries, with the same operations performed in the reverse direction to reduce the dimensions in the decoder section. The properties of the neural network removed from each stage of the encoder phase are transferred to the decoder phase. This is shown schematically in Figure 3 with horizontal connections [35]. Therefore, small details are able to be collected that would otherwise be lost in the encoder part, thereby increasing the estimated segmentation quality.  Figure 4 shows the network structure of the fusion V-Net architecture. The aim is to detect more features of the same scene by using fewer modalities. Coding the basic information for the architectural structure provides a level of learning without large amounts of data, so the use of a small-scale dataset also enables successful results in terms of performance [36]. Based on this idea, the fusion V-Net model inputs multiple parameters to the network in the encoder part. Figure 3 shows the encoder part of this structure. At this point, there is no limit to the reproduction of input parameters. However, increasing the number of parameters unnecessarily can disrupt and tire the network architecture; therefore, input parameters should be increased in a certain format and unnecessary duplication should be avoided. Figure 5 shows a simple late fusion architecture structure.  Figure 4 shows the network structure of the fusion V-Net architecture. The aim is to detect more features of the same scene by using fewer modalities. Coding the basic information for the architectural structure provides a level of learning without large amounts of data, so the use of a small-scale dataset also enables successful results in terms of performance [36]. Based on this idea, the fusion V-Net model inputs multiple parameters to the network in the encoder part. Figure 3 shows the encoder part of this structure. At this point, there is no limit to the reproduction of input parameters. However, increasing the number of parameters unnecessarily can disrupt and tire the network architecture; therefore, input parameters should be increased in a certain format and unnecessary duplication should be avoided. Figure 5 shows a simple late fusion architecture structure.

Fusion V-Net Architecture
the fusion V-Net model inputs multiple parameters to the network in the encoder part. Figure 3 shows the encoder part of this structure. At this point, there is no limit to the reproduction of input parameters. However, increasing the number of parameters unnecessarily can disrupt and tire the network architecture; therefore, input parameters should be increased in a certain format and unnecessary duplication should be avoided. Figure 5 shows a simple late fusion architecture structure.

ET-Net Architecture
In the architecture shown in Figure 6, an edge guidance module (EGM) is used to determine edge displays and maintain local edge characteristics. A weighted aggregation module (WAM) is then used to collect the side-outputs from the decoding layers. In this way, Edge-Attention Guidance Network (ET-Net) architecture is created by combining two different network structures [37]. While "Conv" symbolizes the convolutional layer, "U", "C" and "+" mean upsampling, concatenation, and aggregation, respectively.
The main goal of the architecture is to transmit edge attention impressions to the upper layers to improve the output from the decoder phase. The first inputs for each encoder block pass through the feature extraction section, consisting of a (1 × 1)-(3 × 3)-(1 × 1) convolutional layer stack, and then the system is operated by gathering the shortcuts of the inputs to achieve the desired outputs.
A residual connection allows the architecture to produce class-specific l features [38,39]. The decoding block uses an in-depth convolution to increase the low-and high-level features. Then, the 1 × 1 convolution layer is processed to combine the number of channels.

ET-Net Architecture
In the architecture shown in Figure 6, an edge guidance module (EGM) is used to determine edge displays and maintain local edge characteristics. A weighted aggregation module (WAM) is then used to collect the side-outputs from the decoding layers. In this way, Edge-Attention Guidance Network (ET-Net) architecture is created by combining two different network structures [37]. While "Conv" symbolizes the convolutional layer, "U", "C" and "+" mean upsampling, concatenation, and aggregation, respectively.
The main goal of the architecture is to transmit edge attention impressions to the upper layers to improve the output from the decoder phase. The first inputs for each encoder block pass through the feature extraction section, consisting of a (1 × 1)-(3 × 3)-(1 × 1) convolutional layer stack, and then the system is operated by gathering the shortcuts of the inputs to achieve the desired outputs.
A residual connection allows the architecture to produce class-specific l features [38,39]. The decoding block uses an in-depth convolution to increase the low-and high-level features. Then, the 1 × 1 convolution layer is processed to combine the number of channels.
to improve the output from the decoder phase. The first inputs for each encoder block pass through the feature extraction section, consisting of a (1 × 1)-(3 × 3)-(1 × 1) convolutional layer stack, and then the system is operated by gathering the shortcuts of the inputs to achieve the desired outputs.
A residual connection allows the architecture to produce class-specific l features [38,39]. The decoding block uses an in-depth convolution to increase the low-and high-level features. Then, the 1 × 1 convolution layer is processed to combine the number of channels.   Figure 7 shows the proposed hybrid V-Net architecture consisting of the encoder and decoder blocks, as in the classic V-Net architecture. (For codes: https://github.com/turkfuat/KiTS19-Hybird-V-Net-Model). Combining two different V-net models in the encoder and decoder phases, the hybrid architecture is also supported with a unique ResNet layer before the output layer.

Hybrid V-Net Architecture
The encoder block was created based on the fusion V-Net model, and the input parameters were set to be input1 and input2. The decoder block was designed based on the ET-Net architecture. In the encoder block, the input parameters were aimed to pass through the fusion V-Net model to capture all features during segmentation. The goal of the ET-Net model was to catch even the smallest edge features for segmentation. In the decoder phase, the layers were connected by using the edge extraction features of the ET-Net model, which were forwarded to the ResNet++ block, the architecture of which is shown in Figure 8 below.  Figure 7 shows the proposed hybrid V-Net architecture consisting of the encoder and decoder blocks, as in the classic V-Net architecture. (For codes: https://github.com/turkfuat/KiTS19-Hybird-V-Net-Model). Combining two different V-net models in the encoder and decoder phases, the hybrid architecture is also supported with a unique ResNet layer before the output layer.

Hybrid V-Net Architecture
The encoder block was created based on the fusion V-Net model, and the input parameters were set to be input1 and input2. The decoder block was designed based on the ET-Net architecture. In the encoder block, the input parameters were aimed to pass through the fusion V-Net model to capture all features during segmentation. The goal of the ET-Net model was to catch even the smallest edge features for segmentation. In the decoder phase, the layers were connected by using the edge extraction features of the ET-Net model, which were forwarded to the ResNet++ block, the architecture of which is shown in Figure 8 below.    This block can be thought of as two nested ResNet blocks. Unlike a normal ResNet model, this block connects the output layer with the preceding two layers. Thus, small residual blocks before the output can also be captured. Adding this layer to all blocks makes the network very slow, so adding it to the correct layer is extremely important. This block can be thought of as two nested ResNet blocks. Unlike a normal ResNet model, this block connects the output layer with the preceding two layers. Thus, small residual blocks before the output can also be captured. Adding this layer to all blocks makes the network very slow, so adding it to the correct layer is extremely important.
The ResNet1 and ResNet2 structures are shown in Equations (1) and (2). These two blocks represent the classic ResNet architecture [40].
To clarify Equations (1) and (2), let the output layer be layer_n. In this case, the previous layer is represented as layer_ (n − 1). This situation is shown in Equations (3) and (4).
In Equation (5), we see that the ResNet++ architecture combines the two ResNet blocks. The ResNet1 architecture runs first, followed by the ResNet2 architecture.
The ResNet++ block is implemented only in the final stage of the decoding phase, while the ResNet block is implemented in all phases. A detailed demonstration of this hybrid V-Net architecture is given in Table 1.

Dice Similarity Coefficient
The Dice similarity coefficient (DSC) measures the spatial similarity or overlap between two segmentations [41]. It is commonly used as a metric to evaluate the ground truth and segmentation performance in medical images [42]. Figure 9 shows the DSC area chart. The DSC calculation is shown in Equation (8).
where S represents the result of segmentation and R is the corresponding ground-truth label. DSC is designed for image segmentation and is an accepted method to compare binary segmentation of the same image. Generally, a comparison is made between segmentation accuracy and the results of automatic or semiautomatic segmentation methods [43].

Results and Discussion
Having run the V-Net, fusion V-Net, the ET-Net, and the hybrid V-Net models, results were obtained and are discussed below in detail. All four models were run with the same hyperparameters, though all had different network architectures. We computed the Dice coefficient values for the kidneys, taking into account the ground-truth values and tumor labels.
The results shown in this section were calculated based on the average of the five-fold, crossvalidation results obtained from the training dataset. Figure 10 demonstrates the five-fold, crossvalidation algorithm scheme. Each section was run separately, and the average validation result was calculated. In this way, we aimed to obtain a higher validation sensitivity in the training phase. The DSC calculation is shown in Equation (8).
where S represents the result of segmentation and R is the corresponding ground-truth label. DSC is designed for image segmentation and is an accepted method to compare binary segmentation of the same image. Generally, a comparison is made between segmentation accuracy and the results of automatic or semiautomatic segmentation methods [43].

Results and Discussion
Having run the V-Net, fusion V-Net, the ET-Net, and the hybrid V-Net models, results were obtained and are discussed below in detail. All four models were run with the same hyperparameters, though all had different network architectures. We computed the Dice coefficient values for the kidneys, taking into account the ground-truth values and tumor labels.
The results shown in this section were calculated based on the average of the five-fold, cross-validation results obtained from the training dataset. Figure 10 demonstrates the five-fold, cross-validation algorithm scheme. Each section was run separately, and the average validation result was calculated. In this way, we aimed to obtain a higher validation sensitivity in the training phase. though all had different network architectures. We computed the Dice coefficient values for the kidneys, taking into account the ground-truth values and tumor labels.
The results shown in this section were calculated based on the average of the five-fold, crossvalidation results obtained from the training dataset. Figure 10 demonstrates the five-fold, crossvalidation algorithm scheme. Each section was run separately, and the average validation result was calculated. In this way, we aimed to obtain a higher validation sensitivity in the training phase.  Figure 11 shows the kidney and tumor DSC graphs obtained during the training period. A wavy curve was observed during the early stages of the training, possibly because of the difficulty level of the segmentation. In the following steps, The DSC values were gradually fixed and reached the desired level. Validation Dice loss charts initially showed partial fluctuations but remained at low, reasonable levels thereafter.  Figure 11 shows the kidney and tumor DSC graphs obtained during the training period. A wavy curve was observed during the early stages of the training, possibly because of the difficulty level of the segmentation. In the following steps, The DSC values were gradually fixed and reached the desired level. Validation Dice loss charts initially showed partial fluctuations but remained at low, reasonable levels thereafter.  Table 2 shows the Dice coefficients obtained from the validation and test results. The validation and test results indicated that all V-Net models achieved a certain success rate. For kidney segmentation, the classical V-Net model produced the most successful result, with a Dice coefficient of 0.940. For tumor segmentation, the hybrid V-Net model reached the highest Dice coefficient, at 0.865. However, the hybrid V-Net model, which achieved a more consistent and higher Dice coefficient than other models, seemed to be more successful in both kidney segmentation and tumor detection. As above, these results were obtained from the network architecture we prepared for this study using the KiTS19 training dataset.   Table 2 shows the Dice coefficients obtained from the validation and test results. The validation and test results indicated that all V-Net models achieved a certain success rate. For kidney segmentation, the classical V-Net model produced the most successful result, with a Dice coefficient of 0.940. For tumor segmentation, the hybrid V-Net model reached the highest Dice coefficient, at 0.865. However, the hybrid V-Net model, which achieved a more consistent and higher Dice coefficient than other models, seemed to be more successful in both kidney segmentation and tumor detection. As above, these results were obtained from the network architecture we prepared for this study using the KiTS19 training dataset.  Table 3 shows a general comparison of the results obtained from other kidney and renal tumor segmentation studies in the literature with the results obtained from this study. Although the datasets were different, our model seemed to be successful in calculating the Dice coefficient when kidney segmentation was taken into account. This model for kidney and kidney tumor segmentation was further compared with the model that came first in the Kits19 challenge. Although the training and test sets were not the same, our model seemed to be particularly successful in calculating the kidney tumor membrane coefficient. Currently, 90 test sets in the Kits19 challenge cannot be due to unavailable public access.
Tuncer and Alkan [19] were able to perform kidney segmentation on 100 images with a Dice coefficient of 0.893 using the decision support method, which is a machine learning model.
Cuingnet et al. [44] performed 3D kidney segmentation using the random forest machine-learning algorithm. The average Dice coefficient value reached 0.97.
Zheng et al. [45] developed an architecture called CNN + MSL, with which they performed precise segmentation, with a Dice coefficient of 0.905.
Milletari et al. [12], though the first implementers of the V-Net model, improved the U-Net model, aiming to achieve success particularly for organs of which segmentation is challenging. They ran the model on a total of 27 test sets and achieved a Dice coefficient of 0.856.
Chenglong et al. [46] developed a deep learning architecture based on Fully Convolutional Networks (FCN). They used the same dataset as Milletari et al. and were able to perform kidney segmentation with a Dice coefficient of 0.95.
Guanyu Yang et al. [20] developed a 3D-FCN-based deep learning architecture for the diagnosis of renal cancer. After testing this architecture on 140 patients, Dice coefficients of 0.931 and 0.802 were achieved for kidney and kidney tumor segmentation, respectively.
Price Jackson et al. [47] managed to segment the left and right kidneys with a CNN-based model, testing this architecture on 89 CT images and achieving Dice coefficients of 0.91 and 0.86 for the right and left kidneys, respectively.
Luana Btista et al. [26] developed a CNN-based model using the Kits19 dataset and achieved a Dice coefficient of 0.963.
Wenshuai Zhao et al. [29] developed a U-Net-based model and tested it on the Kits19 dataset, achieving Dice coefficients of 0.969 and 0.805 for kidney segmentation and tumor segmentation, respectively.
Isensee et al. [34] designed a U-Net-based model using the Kits19 challenge dataset, securing first place in the challenge. They obtained Dice coefficients of 0.979 and 0.854 for kidney and tumor segmentation, respectively.
As for the hybrid-based V-Net model designed in this work, the model achieved coefficients of 0.977 and 0.865 using the Kits19 dataset for kidney and tumor segmentation, respectively.  Figure 12 demonstrates the original images and masks used for kidney segmentation in V-Net models and the segmentation results. V-Net models demonstrate high training and test success, so the results appeared very similar. However, a closer look revealed that the hybrid V-Net model was more successful than the current V-Net models in detecting small details, suggesting that the improvements made in the hybrid V-Net model yielded positive results.      Figure 13 shows the original images and masks used for renal tumor segmentation in the V-Net models and the segmentation results. The figure indicates that each V-Net model achieved an acceptable success rate for kidney tumors. Comparing manual segmentation results for tumor detection, the hybrid V-Net model successfully demonstrated the details, even drawn with sharp lines in many cases. The hybrid V-Net model produced more successful output in the encoder portion when integrated with the fusion V-Net model and in the decoder portion in combination with the ETV-Net model. Figure 13. Original input CT images and kidney tumor segmentation output images. Figure 13. Original input CT images and kidney tumor segmentation output images.
Our hypothesis was supported by the fact that the hybrid V-Net model, developed for soft tissues where kidney and tumor segmentation are challenging, yielded more successful results than other models. We, therefore, designed a model that produced better results by improving the existing V-Net models. Inspired by the fusion architecture, we used the encoder phase as two separate encoder phases and combined the layers in the decoder phase to capture edge features. As a result, we obtained a model with better performance. It is evident from the Dice coefficient results we obtained that, with the effect of the ResNet++ block on the output, the model can capture even small details. However, despite the improvements in the hybrid V-Net model, the training took an average of five days. In saying that, considering the additional processing volume in this model (such as ResNet++), our model could be thought to run faster than existing models. The creation of new models to shorten the training period is possible in our future studies. We should also emphasize that such fully automatic segmentation systems as Fuzzy C-Means clustering and iterative optimal threshold selection algorithms [33] can be more successful on existing datasets, considering the difficulties of manual segmentation such as processing time and detection of errors in the segmentation process.

Conclusions
In this study, we proposed a new hybrid V-Net model using the superior features of existing V-Net models. We ran four models, including the hybrid V-Net model, on this dataset and performed kidney and tumor segmentation separately. The results showed that the hybrid V-Net model yielded more successful results for kidney and renal tumor segmentation than other V-Net models, with rates of 0.977 and 0.865 DSC, respectively.
This study showed that V-Net models successfully perform organ and tumor segmentation via computerized images and that more successful models can be developed from existing V-Net models by considering the encoding and decoding stages separately. More suitable models could be designed for multiple organ segmentation using medical images. This study could also be used as a guide for future hybrid models as the success of the implementation of the hybrid V-Net model for the first time was positively contributed to by the ResNet++ architecture. The ResNet++ architecture was applied only to the output layer, making it possible to capture small details in the segmentation. This situation is extremely important for model design because each parameter can only be successful when added to the appropriate blocks of the model. The results presented here suggest that more research regarding the hyperparameters of this model is pertinent.
Following this study, we aim to investigate the shortcomings of our hybrid V-Net model; by eliminating these, we plan to develop more practical systems for kidney or other organ segmentation in medical imaging. Future studies regarding deep learning designs, especially in the field of medical imaging, should not be based on systems with complex structures. On the contrary, research should be concentrated on areas where better results can be obtained with small improvements to existing models (such as changing hyperparameters), thereby removing unnecessary load and improving existing model effectiveness. Future studies should further focus on shortening the training period of deep learning models. It is necessary to simplify the systems to which the models could be applied and reduce complexity to develop more successful models that can be used in various fields.