Data Balancing Based on Pre-Training Strategy for Liver Segmentation from CT Scans

Data imbalance is often encountered in deep learning process and is harmful to model training. The imbalance of hard and easy samples in training datasets often occurs in the segmentation tasks from Contrast Tomography (CT) scans. However, due to the strong similarity between adjacent slices in volumes and different segmentation tasks (the same slice may be classified as a hard sample in liver segmentation task, but an easy sample in the kidney or spleen segmentation task), it is hard to solve this imbalance of training dataset using traditional methods. In this work, we use a pre-training strategy to distinguish hard and easy samples, and then increase the proportion of hard slices in training dataset, which could mitigate imbalance of hard samples and easy samples in training dataset, and enhance the contribution of hard samples in training process. Our experiments on liver, kidney and spleen segmentation show that increasing the ratio of hard samples in the training dataset could enhance the prediction ability of model by improving its ability to deal with hard samples. The main contribution of this work is the application of pre-training strategy, which enables us to select training samples online according to different tasks and to ease data imbalance in the training dataset.


Introduction
Accurate segmentation of the liver can greatly help the subsequent segmentation of liver tumors, as well as assisting doctors in making accurate disease condition assessment and treatment planning of patients [1].Traditionally, liver delineation relies on the slice-by-slice manual segmentation of Contrast Tomography (CT) or Magnetic Resonance Imaging (MRI) by radiologists, which is time-consuming and prone to influence by internal and external variations.With the rapid increase of CT and MRI data, traditional manual segmentation method has become increasingly unable to meet the clinical needs.Therefore, automatic segmentation tools are required for practical clinical applications.
Automatic segmentation methods such as region growing, intensity thresholding, and deformable model-based methods have achieved automatic or semi-automatic segmentation to a certain extent, with good segmentation results.However, these models rely on hand-crafted features and have limited feature extraction ability.Recently, methods of deep learning, especially full convolutional networks (FCNs), have achieved great success on a broad array of recognition problems [2][3][4].Many researchers advance this stream using deep learning methods in segmentation tasks such as liver [1,[5][6][7], kidney [8], vessel [9][10][11] and pancreas [12][13][14].All the models mentioned above are based on a large amount of data.However, there often are two kinds of data imbalance problems in the training process for the segmentation of CT scans: (i) data imbalance in images: the imbalance between background voxels and target voxels, as shown in Figure 1; (ii) data imbalance between images: the imbalance of hard or easy predicted examples in training datasets (the easily segmented slices are called easy samples or easy slices, while the difficult samples are defined as hard samples or hard slices) in training dataset.As shown in Figure 2A,B, the features of some slices are obvious and easy to segment.However, in some others, as shown in Figure 2C,D), the features of liver are not obvious, which may be due to poor quality of CT image or the liver self-defect (e.g., liver morphological variation, liver lesions, etc.), and it is difficult to accurately segment liver from these slices.Moreover, it is easy to qualitatively divide hard samples and easy samples according to the segmentation results, but it is difficult or almost impossible to describe the characteristics of hard samples and easy samples, and accurately distinguish them in training dataset before training process.
Appl.Sci.2019, 9, x FOR PEER REVIEW 2 of 9 problems in the training process for the segmentation of CT scans: (i) data imbalance in images: the imbalance between background voxels and target voxels, as shown in Figure 1; (ii) data imbalance between images: the imbalance of hard or easy predicted examples in training datasets (the easily segmented slices are called easy samples or easy slices, while the difficult samples are defined as hard samples or hard slices) in training dataset.As shown in Figure 2A,B, the features of some slices are obvious and easy to segment.However, in some others, as shown in Figure 2C,D), the features of liver are not obvious, which may be due to poor quality of CT image or the liver self-defect (e.g., liver morphological variation, liver lesions, etc.), and it is difficult to accurately segment liver from these slices.Moreover, it is easy to qualitatively divide hard samples and easy samples according to the segmentation results, but it is difficult or almost impossible to describe the characteristics of hard samples and easy samples, and accurately distinguish them in training dataset before training process.Using Dice coefficient [15] as the loss function in training process can solve the first kind of data imbalance by reducing or even ignoring the contribution of background voxels.However, due to the problems in the training process for the segmentation of CT scans: (i) data imbalance in images: the imbalance between background voxels and target voxels, as shown in Figure 1; (ii) data imbalance between images: the imbalance of hard or easy predicted examples in training datasets (the easily segmented slices are called easy samples or easy slices, while the difficult samples are defined as hard samples or hard slices) in training dataset.As shown in Figure 2A,B, the features of some slices are obvious and easy to segment.However, in some others, as shown in Figure 2C,D), the features of liver are not obvious, which may be due to poor quality of CT image or the liver self-defect (e.g., liver morphological variation, liver lesions, etc.), and it is difficult to accurately segment liver from these slices.Moreover, it is easy to qualitatively divide hard samples and easy samples according to the segmentation results, but it is difficult or almost impossible to describe the characteristics of hard samples and easy samples, and accurately distinguish them in training dataset before training process.Using Dice coefficient [15] as the loss function in training process can solve the first kind of data imbalance by reducing or even ignoring the contribution of background voxels.However, due to the Using Dice coefficient [15] as the loss function in training process can solve the first kind of data imbalance by reducing or even ignoring the contribution of background voxels.However, due to the similarity between adjacent slices in medical images, and different training tasks (for example the same slice may be a hard example in liver segmentation task but an easy example in kidney or spleen segmentation task), it is difficult to classify medical images in training dataset automatically using traditional methods before the training process.When there are many easy samples, the contribution of hard slices will be overwhelmed in the training process, which could cause a significant reduction in the prediction ability of the model for difficult samples, and may even lead to overfitting.Therefore, it is necessary to classify the training samples and increase the proportion of hard samples in training datasets.
Recently, focal loss, which could automatically adjust the contribution of easy-negative samples in training process and rapidly focus on hard examples in every batch training process, has achieved great success in one-stage detector objection [16].However, focal loss failed to change the imbalance between hard samples and easy samples in training dataset, the contribution of hard slices may still be overwhelmed in the training process.In order to solve or alleviate this imbalance problem, we introduce an online hard example enhancement method to increase the proportion of the hard samples in the training dataset.Frist, we use partial slices in the whole training dataset to train a pre-training model according to the needs of segmentation task, and then the pre-training model is used to distinguish hard samples and easy samples in the rest slices of the whole training dataset, i.e., adding the identified hard samples to the training datasets used in the pre-training processes.Second, the hard slices identified by pre-training model are selected and enhanced by flipping, and then these slices are added to the dataset used in pre-training process to enhance the ratio of hard slices in training dataset, and improve the contribution of hard slices in training process.Therefore, the basic purpose of pre-training strategy is to get a sample classifier, which could distinguish hard/easy slices according to actual task need.
To demonstrate the effectiveness of the proposed method, we adopt a classical 2D FCN model based on VGG-16 [17] and 2D U-Net [3], as shown in Figures A1 and A2 respectively, for the task of the liver segmentation, kidney segmentation and spleen segmentation from Computed Tomography (CT) scans.

Dataset and Processing
We test our method on datasets acquired from different scanners of different medical institutions.The collected dataset composes of 260 CT scans, with a largely varying in slice spacing from 0.45 mm to 5 mm.And 220 CT scans were randomly selected for training, the rest 40 cases for testing.For images pre-processing, the image intensity values were truncated to the range of [−150, 250] hounsfield unit (HU) to remove the irrelevant details [9].

Evaluation Metrics
Dice coefficient, which measures the amount of an agreement between two image regions, was used to evaluate the segmentation performance on the test dataset.

Implementation Details
Classical 2D FCN model structure and 2D U-Net are used for segmentation tasks from CT scans using the TensorFlow package [9].We use stochastic gradient descent (SGD) with a mini-batch size 16.
Inspired by [1], the "poly" learning rate policy where the current learning rate equals to the initial learning rate multiplying (1−(iterations)/(total_iterations))ˆpower.We set the initial learning rate to 0.001 and the power to 0.9 and the models are trained for up to 10 × 10 5 iterations.We use the Dice coefficient as the loss function in the training process.For data augmentation, we adopt a random mirror, flip for all datasets.We use the aforementioned training strategy in the pre-training process and the final training stage.

Results
As for the strong similarity between adjacent slices in CT scans, we assume that the contribution of some slices could be replaced by others in the training process.To test this idea, we select partial cases at a certain ratio from the whole training dataset, based on their simple statistical information (e.g., the number of slices in volume, proportion of positive samples and negative samples in volume).And then the selected cases were enhanced by flipping and mirroring.
As shown in Table 1, reducing the number of training samples within a certain range has less influence on the segmentation ability of FCN.However, the prediction ability of FCN decreases significantly when the selection of training samples is further reduced.The max value, which refers to the best segmentation results of the model, has little change in different selection ratio (the ratio of training scans in part B to total number of scans in training dataset) experiments.Meanwhile, the min value, referring to the worst segmentation results of the model, decreases significantly when training samples decrease substantially.These phenomena are also observed in kidney segmentation and spleen segmentation from CT scans using FCN model, as shown in Tables A1 and A2.Moreover, the same results were also discovered in liver segmentation, kidney segmentation and spleen segmentation tasks using U-Net model, as shown in Tables A3-A5.These results suggest that there is redundancy in the training dataset, and that too little training data is harmful in the model training process.As for the performance of FCN model begins to decline significantly when the selection ratio is less than 0.5, so we set selection ratio as 0.5 in the proposed model, and divide the training dataset into two parts (A and B) in liver segmentation.Slices in part B are used for model training, and we get the pre-training model after 5 × 10 5 iterations.Using the pre-training model to predict slices in part A, we then simply classify these slices in part A into two categories, i.e., hard and easy simples, based on their Dice score.In the liver segmentation task using FCN model, we set the threshold to 0.923, the min Dice scores of baseline.Six thousand, two hundred and sixty-eight slices are classified as hard samples; however, 35,984 slices are classified as easy samples, almost 6-fold the number of hard samples.Hard samples in part A were enhanced by flipping and added to the dataset (part B) used in pre-training process.Then, we continue the training process until model reaching 10 × 10 5 iterations.
As shown in Table 1, the proposed model performs slightly better than the baseline in liver segmentation with a smaller training dataset.Moreover, adding hard examples has almost no effect on the max value of Dice score, but it can significantly increase the min value compared with the baseline.This indicates that increasing the ratio of hard samples in the training dataset has little influence on easily segmented cases, but could greatly improve the segmentation ability of model on hard samples.
The segmentation results display in 3D form in Figure 3A,B show that the proposed method could enhance liver segmentation results, especially in some details.Liver segmentation results of hard examples have been greatly improved compared with the baseline, as shown in Figure 3C,D, which may be attributed to the increase of the number of hard samples in training dataset.The above results suggest that enhancing the proportion of hard samples in the training dataset could improve the prediction performance of FCN model in the liver segmentation task, as well as model's ability to deal with hard samples.As shown in Table 1, the proposed model performs slightly better than the baseline in liver segmentation with a smaller training dataset.Moreover, adding hard examples has almost no effect on the max value of Dice score, but it can significantly increase the min value compared with the baseline.This indicates that increasing the ratio of hard samples in the training dataset has little influence on easily segmented cases, but could greatly improve the segmentation ability of model on hard samples.
The segmentation results display in 3D form in Figure 3A,B show that the proposed method could enhance liver segmentation results, especially in some details.Liver segmentation results of hard examples have been greatly improved compared with the baseline, as shown in Figure 3C,D, which may be attributed to the increase of the number of hard samples in training dataset.The above results suggest that enhancing the proportion of hard samples in the training dataset could improve the prediction performance of FCN model in the liver segmentation task, as well as model's ability to deal with hard samples.

Discussion
It is often thought that the more data, the better the performance in deep learning.However, in this work, we observed that a proper reduction of training samples in training process had little effect on the segmentation performance of model.This may be due to the strong similarity between two adjacent slices in CT images, which makes it difficult to ensure each image in the training dataset is independent from others; in other words, the contribution of some samples can be replaced by others in the training process.However, it is hard to screen out which one may be redundant.The relatively

Discussion
It is often thought that the more data, the better the performance in deep learning.However, in this work, we observed that a proper reduction of training samples in training process had little effect on the segmentation performance of model.This may be due to the strong similarity between two adjacent slices in CT images, which makes it difficult to ensure each image in the training dataset is independent from others; in other words, the contribution of some samples can be replaced by others in the training process.However, it is hard to screen out which one may be redundant.The relatively shallow network structure, which has relatively weak deep feature extraction capability, may be another reason for the phenomenon observed in this work.Meanwhile, the significantly reduced performance of the model in the case of a large reduction in training dataset also supports the point that the more data, the better the performance in deep learning.
Additionally, the same slices may play different roles in different segmentation tasks.For example, the positive-hard samples in the liver segmentation task may be negative-easy ones in kidney or spleen segmentation.Therefore, it is difficult to classify samples with the traditional unsupervised method.Inspired by the pre-training strategy, we use a pre-training method as a sample classifier to classify hard samples and easy samples in training dataset.We obtained better performance from the model after adding the enhanced hard examples.

Figure 1 .
Figure 1.Examples of the imbalance between background voxels and target voxels in images.Each row shows a CT scan from individual patients.The read regions denote the liver.

Figure 2 .
Figure 2. Examples of easy and hard predicted slices in CT scans.The predicted results are based on the FCN model with 10 × 10 5 iterations.A and B display the easy samples; C and D display the hard slices.Blue and red lines denote ground truth and prediction results.Each row shows results acquired from an individual case.

Figure 1 .
Figure 1.Examples of the imbalance between background voxels and target voxels in images.Each row shows a CT scan from individual patients.The read regions denote the liver.

Figure 1 .
Figure 1.Examples of the imbalance between background voxels and target voxels in images.Each row shows a CT scan from individual patients.The read regions denote the liver.

Figure 2 .
Figure 2. Examples of easy and hard predicted slices in CT scans.The predicted results are based on the FCN model with 10 × 10 5 iterations.A and B display the easy samples; C and D display the hard slices.Blue and red lines denote ground truth and prediction results.Each row shows results acquired from an individual case.

Figure 2 .
Figure 2. Examples of easy and hard predicted slices in CT scans.The predicted results are based on the FCN model with 10 × 10 5 iterations.(A,B) display the easy samples; (C,D) display the hard slices.Blue and red lines denote ground truth and prediction results.Each row shows results acquired from an individual case.
Inspired by pre-training strategy, in this work, a pre-training model is used as a sample classifier to classify hard samples and easy samples in training dataset.Frist, the whole training dataset was divided into two parts (A and B) based on their simple statistics information (e.g., the number of slices in volume, the proportion of positive and negative samples in volume).In this way, the ratio of positive and negative slices in two subsets (A and B) can be guaranteed the same as that of the whole training dataset.Part A is used for the later sample classification and screening, while part B is used for model pre-training.Second, slices in part B are enhanced by flipping and mirroring, and then these enhanced slices are used in model pre-training process.And we get a pre-training model when model is trained to a set iteration (such as 5 × 10 5 iterations in this work).Third, the pre-training model is used to predict slices in part A, and all slices in part A are simply divided into two categories, hard samples, and easy samples, by their Dice score.Next, the hard slices in part A are enhanced by flipping, and then added to the training dataset (part B) used in pre-training process.Finally, we continue the training process until reaching to the set 10 × 10 5 iterations, and then get the final segmentation model.Just 5 × 10 5 iterations are needed in the final training process if 5 × 10 5 iterations were done in the pre-training process and the pre-training model structure is consistent with the final model, while 10 × 10 5 iterations are needed in the final training stage if the pre-training model structure is inconsistent with the final model.In this study, we use the same model structure in pre-training process and final training stage.
Appl.Sci.2019, 9, x FOR PEER REVIEW 5 of 9 part A, we then simply classify these slices in part A into two categories, i.e., hard and easy simples, based on their Dice score.In the liver segmentation task using FCN model, we set the threshold to 0.923, the min Dice scores of baseline.Six thousand, two hundred and sixty-eight slices are classified as hard samples; however, 35,984 slices are classified as easy samples, almost 6-fold the number of hard samples.Hard samples in part A were enhanced by flipping and added to the dataset (part B) used in pre-training process.Then, we continue the training process until model reaching 10 × 10 5 iterations.

Figure 3 .
Figure 3. Results of Liver segmentation using FCN model.A and B display the 3D liver segmentation result of the baseline and proposed a model, respectively; C and D display the hard samples liver segmentation results of the baseline and proposed a model, respectively; Blue and red lines in C and D denote ground truth and prediction results.Each row shows results acquired from an individual case.

Figure 3 .
Figure 3. Results of Liver segmentation using FCN model.(A,B) display the 3D liver segmentation result of the baseline and proposed a model, respectively; (C,D) display the hard samples liver segmentation results of the baseline and proposed a model, respectively; Blue and red lines in C and D denote ground truth and prediction results.Each row shows results acquired from an individual case.

Table 1 .
Liver segmentation results on test dataset based on different selection ratio using FCN model.

Table A4 .
Kidney segmentation results on test dataset based on different selection ratio using U-Net model.

Table A5 .
Spleen segmentation results on test dataset based on different selection ratio using U-Net model.