Automatic Pancreas Segmentation Using Coarse-Scaled 2D Model of Deep Learning: Usefulness of Data Augmentation and Deep U-Net

: Combinations of data augmentation methods and deep learning architectures for automatic pancreas segmentation on CT images are proposed and evaluated. Images from a public CT dataset of pancreas segmentation were used to evaluate the models. Baseline U-net and deep U-net were chosen for the deep learning models of pancreas segmentation. Methods of data augmentation included conventional methods, mixup, and random image cropping and patching (RICAP). Ten combinations of the deep learning models and the data augmentation methods were evaluated. Four-fold cross validation was performed to train and evaluate these models with data augmentation methods. The dice similarity coefficient (DSC) was calculated between automatic segmentation results and manually annotated labels and these were visually assessed by two radiologists. The performance of the deep U-net was better than that of the baseline U-net with mean DSC of 0.703–0.789 and 0.686–0.748, respectively. In both baseline U-net and deep U-net, the methods with data augmentation performed better than methods with no data augmentation, and mixup and RICAP were more useful than the conventional method. The best mean DSC was obtained using a combination of deep U-net, mixup, and RICAP, and the two radiologists scored the results from this model as good or perfect in 76 and 74 of the 82 cases.


Introduction
Identification of anatomical structures is a fundamental step for radiologists in the interpretation of medical images. Similarly, automatic and accurate organ identification or segmentation is important for medical image analysis, computer-aided detection, and computer-aided diagnosis. To date, many studies have worked on automatic and accurate segmentation of organs, including lung, liver, pancreas, uterus, and muscle [1][2][3][4][5].
An estimated 606,880 Americans were predicted to die from cancer in 2019, in which 45,750 deaths would be due to pancreatic cancer [6]. Among all major types of cancers, the five-year relative survival rate of pancreatic cancer was the lowest (9%). One of the reasons for this low survival rate is the difficulty in the detection of pancreatic cancer in its early stages, because the organ is located in the retroperitoneal space and is in close proximity to other organs. A lack of symptoms is another reason for the difficulty of its early detection. Therefore, computer-aided detection and/or diagnosis using computed tomography (CT) may contribute to a reduction in the number of deaths caused by pancreatic cancer, similar to the effect of CT screenings on lung cancer [7,8]. Accurate segmentation of pancreas is the first step in the computer-aided detection/diagnosis system of pancreatic cancer.
Compared with conventional techniques of organ segmentation, which use hand-tuned filters and classifiers, deep learning, such as convolutional neural networks (CNN), is a framework, which lets computers learn and build these filters and classifiers from a huge amount of data. Recently, deep learning has been attracting much attention in medical image analysis, as it has been demonstrated as a powerful tool for organ segmentation [9]. Pancreas segmentation using CT images is challenging because the pancreas does not have a distinct border with its surrounding structures. In addition, pancreas has a large shape and size variability among people. Therefore, several different approaches to pancreas segmentation using deep learning have been proposed [10][11][12][13][14][15].
Previous studies designed to improve the deep learning model of automatic pancreas segmentation [10][11][12][13][14][15] can be classified using three major aspects: (i) dimension of the convolutional network, two-dimensional model (2D) versus three-dimensional model (3D); (ii) use of coarse-scaled model versus fine-scaled model; (iii) improvement of network architecture. In (i), the accuracy of pancreas segmentation was improved in a 3D model and compared with a 2D model; the 3D model makes it possible to fully utilize the 3D spatial information of pancreas, which is useful for grasping the large variability in pancreas shape and size. In (ii), an initial coarse-scaled model was used to obtain a rough region of interest (ROI) of the pancreas, and then the ROI was used for segmentation refinement using a fine-scaled model of pancreas segmentation. The difference in mean dice similarity coefficient (DSC) between the coarse-scaled and find-scaled models ranged from 2% to 7%. In (iii), the network architecture of a deep learning model was modified for efficient segmentation. For example, when an attention unit was introduced in a U-net, the segmentation accuracy was better than in a conventional U-net [12].
In previous studies, the usefulness of data augmentation in pancreas segmentation was not fully evaluated; only conventional methods of data augmentation were utilized. Recently proposed methods of data augmentation, such as mixup [16] and random image cropping and patching (RICAP) [17], were not evaluated.
In conventional data augmentation, horizontal flipping, vertical flipping, scaling, rotation, etc., are commonly used. It is necessary to find an effective combination of these, since among the possible combinations, some degrade the performance. Due to the number of the combinations, it is relatively cumbersome to eliminate the counterproductive combinations in conventional data augmentation. For this purpose, AutoAugment finds the best combination of data augmentation [18]. However, it is computationally expensive due to its use of reinforcement learning. In this regard, mixup and RICAP are easier to adjust than conventional data augmentation because they both have only one parameter.
The purpose of the current study is to evaluate and validate the combinations of different types of data augmentation and network architecture modification of U-net [19]. A deep U-net was used, to evaluate the usefulness of network architecture modification of U-net.

Materials and Methods
The current study used anonymized data extracted from a public database. Therefore, institutional review board approval was waived.

Dataset
The public dataset (Pancreas-CT) used in the current study includes 82 sets of contrast-enhanced abdominal CT images, where pancreas was manually annotated slice-by-slice [20,21]. This dataset is publicly available from The Cancer Imaging Archive [22]. The Pancreas-CT dataset is commonly used to benchmark the segmentation accuracy of pancreas on CT images. The CT scans in the dataset were obtained from 53 male and 27 female subjects. The age of the subjects ranged from 18 to 76 years with a mean age of 46.8 ± 16.7. The CT images were acquired with Philips and Siemens multi-detector CT scanners (120 kVp tube voltage). Spatial resolution of the CT images is 512 × 512 pixels with varying pixel sizes, and slice thickness is between 1.5−2.5 mm. As a part of image preprocessing, the pixel values for all sets of CT images were clipped to [−100, 240] Hounsfield units, then rescaled to the range [0, 1]. This preprocessing was commonly used for the Pancreas-CT dataset [15].

Deep Learning Model
U-net was used as a baseline model of deep learning in the current study [19]. U-net consists of encoding-decoding architecture. Downsampling and upsampling are performed in the encoding and decoding parts of U-net, respectively. The most important characteristic of U-net is the presence of shortcut connections between the encoding part and the decoding part at equal resolution. While the baseline U-net performs downsampling and upsampling 4 times [19], deep U-net performs downsampling and upsampling 6 times. In addition to the number of downsampling and upsampling, the number of feature maps in the convolution layer and the use of dropout were changed in the deep U-net; the number of feature maps in the first convolution layer equaled to 40 and dropout probability to 2%. In the baseline U-net, 64 feature maps and no dropout were used. In both, the baseline U-net and the deep U-net, the number of feature maps in the convolution layer was doubled after each downsampling. Figure 1 presents the deep U-net model of the proposed method. Both the baseline U-net and deep U-net utilized batch normalization. Keras (https://keras.io/) with Tensorflow (https://www.tensorflow.org/) backends was used for the implementation of the U-net models. Image dimension of the input and output in the two U-net models was 512 × 512 pixels.

Data Augmentation
To prevent overfitting in the training of the deep learning model, we utilized the following three types of data augmentation methods: conventional method, mixup [16], and RICAP [17]. Although mixup and RICAP were initially proposed for image classification tasks, we utilized them for segmentation by merging or cropping/patching labels in the same way as is done for images.
Conventional augmentation methods included ±5° rotation, ±5% x-axis shift, ±5% y-axis shift, and 95%-105% scaling. Both image and label were changed by the same transformation when using a conventional augmentation method.
Mixup generates a new training sample from linear combination of existing images and their labels [16]. Here, two sets of training samples are denoted by (x, y) and (x', y'), where x and x' are images, and y and y' are their labels. A generated sample (x # , y # ) is given by: where λ ranges from 0 to 1 and is distributed according to beta distribution: ~ ( , ) for ∈ (0, ∞). The two samples to be combined are selected randomly from the training data. The hyperparameter β of mixup was set to 0.2 empirically.
RICAP generates a new training sample from four randomly selected images [17]. The four images are randomly cropped and patched according to a boundary position ( , ℎ), which is determined according to beta distribution: ~ a( , ) and ℎ~ ( , ). We set the hyperparameter β of RICAP to 0.4 empirically. For four images to be combined, the coordinates ( k, k) ( = 1, 2, 3, and 4) of the upper left corners of the cropped areas are randomly selected. The sizes of the four cropped images are determined based on the value ( , ℎ), such that they do not increase the original image size. A generated sample is obtained by combining the four cropped images. In the current study, the image and its label were cropped at the same coordinate and size.

Training
Dice loss function was used as the optimization target of the deep learning models. RMSprop was used as the optimizer, and its learning rate was set to 0.00004. The number of training epochs was set to 45. Following previous works on pancreas segmentation, we used 4-fold cross-validation to assess the robustness of the model (20 or 21 subjects were chosen for validation in folds). The hyperparameters related with U-net and its training were selected using random search [23]. After the random search, the hyperparameters were fixed. The following 10 combinations of deep learning models and data augmentation methods were used: 1. Baseline U-net + no data augmentation, 2. Baseline U-net + conventional method, 3. Baseline U-net + mixup, 4. Baseline U-net + RICAP, 5. Baseline U-net + RICAP + mixup, 6. Deep U-net + no data augmentation, 7. Deep U-net + conventional method, 8. Deep U-net + mixup, 9. Deep U-net + RICAP, 10. Deep U-net + RICAP + mixup.

Evaluation of Pancreas Segmentation
For each validation case of the Pancreas-CT dataset, three-dimensional CT images were processed slice-by-slice using the trained deep learning models, and the segmentation results were stacked. Except for the stacking, no complex postprocessing was utilized. Quantitative and qualitative evaluations were performed for the automatic segmentation results.
The metrics of quantitative evaluation were calculated using the three-dimensional segmentation results and annotated labels. Four types of metrics were used for the quantitative evaluation of the segmentation results: dice similarity coefficient (DSC), Jaccard index (JI), sensitivity (SE), and specificity (SP). These metrics are defined by the following equations: where |P|, |L|, and |I| denote the number of voxels for pancreas segmentation results, annotated label of pancreas segmentation, and three-dimensional CT images, respectively. | ∩ | represents the number of voxels where the deep learning models can accurately segment pancreas (true positive). Before calculating the four metrics, a threshold of 0.5 was used for obtaining pancreas segmentation mask from the output of the U-net [24]. The threshold of 0.5 was fixed for all the 82 cases. A Wilcoxon signed rank test was used to test statistical significance among the DSC results of 10 combinations of deep learning models and data augmentation methods. Bonferroni correction was used for controlling family wise error rate. p-values less than 0.05/45 = 0.00111 was considered as statistical significance. For the qualitative evaluation, two radiologists with 14 and 6 years of experience visually evaluated both the manually annotated labels and automatic segmentation results using a 5-point scale: 1, unacceptable; 2, slightly unacceptable; 3, acceptable; 4, good; 5, perfect. Inter-observer variability between the two radiologists were evaluated using weighted kappa with squared weight. Table 1 shows results of the qualitative evaluation of the pancreas segmentation of Deep U-net + RICAP + mixup and the manually annotated labels. The mean visual scores of manually annotated labels were 4.951 and 4.902 for the two radiologists, and those of automatic segmentation results were 4.439 and 4.268. The mean score of automatic segmentation results demonstrates that the accuracy of the automatic segmentation was good; more than 92.6% (76/82) and 87.8% (74/82) of the cases were scored as 4 or above. Notably, Table 1 shows that the manually annotated labels were scored as 4 (good, but not perfect) in four and eight cases by the two radiologists. Weighted kappa values between the two radiologists were 0.465 (moderate agreement) for the manually annotated labels and 0.723 (substantial agreement) for the automatic segmentation results.  Table 2 shows the results of the quantitative evaluation of pancreas segmentation. Mean and standard deviation of DSC, JI, SE, and SP are calculated from the validation cases of 4-fold cross validation for the Pancreas-CT dataset. Mean DSC of the deep U-net (0.703-0.789) was better than the mean DSC of the baseline U-net (0.686-0.748) across all data augmentation methods. Because mean SP was 1.00 in all the combinations, non-pancreas lesions were not segmented by the models. Therefore, mean DSC was mainly affected by mean SE (segmentation accuracy only for pancreas lesion) as shown in Table 2. Table 2 also shows the usefulness of data augmentation. In both, the baseline U-net and deep U-net, the model combined with any of the three types of data augmentation performed better than the model with no data augmentation. In addition, mixup and RICAP were more useful than the conventional method; the best mean DSC was obtained using the combination of mixup and RICAP. The best mean DSC was obtained using the deep U-net with RICAP and mixup. Note: data are shown as mean ± standard deviation. Abbreviations: Random image cropping and patching (RICAP), dice similarity coefficient (DSC), Jaccard index (JI), sensitivity (SE), and specificity (SP). Table B1 of Appendix B shows the results of the Wilcoxon signed rank test. After the Bonferroni correction, the DSC differences between Deep U-net + RICAP + mixup and the other six models were statistically significant.

Results
Representative images of pancreas segmentation are shown in Figures 2 and 3. In the case of Figure 2, the manually annotated label was scored as 4 by the two radiologists because the main pancreas duct and its surrounding tissue were excluded from the label.

Discussion
The results of the present study show that the three types of data augmentation were useful for the pancreas segmentation in both the baseline U-net and deep U-net. In addition, the deep U-net, which is characterized by additional layers, was overall more effective for automatic pancreas segmentation than the baseline U-net. In data augmentation, not only the conventional method, but also mixup and RICAP were useful for pancreas segmentation; the combination of mixup and RICAP was the most useful. Table 3 summarizes results of previous studies using the Pancreas-CT dataset. While Table 3 includes the studies with coarse-scaled models, Table A1 includes the studies with fine-scaled models. As shown in Table 3, the coarse-scaled 2D model of the current study achieved sufficiently high accuracy, comparable to those of previous studies. While the present study focused on the 2D coarse-scaled models, the data augmentation methods used in the present study can be easily applied to 3D fine-scaled models. Therefore, it can be expected that the combination of the proposed data augmentation methods and 3D fine-scaled models might lead to further improvement of automatic pancreas segmentation. Data augmentation was originally proposed for the classification model, and the effectiveness of mixup was validated for segmentation on brain MRI images [25]. The results of the current study demonstrate the effectiveness of multiple types of data augmentation methods for the two models of U-net for automatic pancreatic segmentation. To the best of our knowledge, the current study is the first to validate the usefulness of multiple types of data augmentation methods in pancreas segmentation. Table 2 shows that deep U-net was better than baseline U-net. Deep U-net included additional layers in its network architecture, compared with baseline U-net. It is speculated that these additional layers could lead to performance improvement for pancreas segmentation. Nakai et al. [26] showed that deeper U-net could efficiently denoised low-dose CT images. They also showed that deeper Unet was better than baseline U-net. Kurata et al. [4] showed that their U-net with additional layers was effective for uterine segmentation. The results of the current study are consistent with the results of these studies. The effectiveness of deep/deeper U-net has not been sufficiently investigated so far. Because U-net can be used for segmentation, image denoising, detection, and modality conversion, it is necessary to evaluate what tasks the deep/deeper U-net is effective for.
Combined use of mixup and RICAP was the best for data augmentation in the current study. The combination of mixup and RICAP was also used in the study of bone segmentation [24]. The results of bone segmentation show that effectiveness of data augmentation was observed in the dataset with limited cases, and the optimal combination was conventional method and RICAP. Based on the studies of bone and pancreas segmentation, usefulness of combination of conventional method, mixup, and RICAP should be further investigated. Sandfort et al. used CycleGAN as data augmentation to improve generalizability in organ segmentation on CT images [27]. CycleGAN was also used for data augmentation in the classification task [28]. Because the computational cost of training CycleGAN is relatively high, the use of CycleGAN as a data augmentation method needs some consideration. In this regard, computational cost of mixup and RICAP is relatively low, and mixup and RICAP are easy to implement.
Accuracy of pancreas segmentation was visually evaluated by the two radiologists in the current study. To our knowledge, there was no study of deep learning to evaluate the segmentation accuracy of pancreas structure visually. The results of visual scores mean that automatic segmentation model of the current study was good. It is expected that the proposed model may be useful for clinical cases if the clinical CT images have similar condition and quality to those of the Pancreas-CT dataset.
In the current study, we evaluated automatic pancreas segmentation using the public dataset called Pancreas-CT. Although this dataset was used in several studies as shown in Table 3, the manually annotated labels of four or eight cases were scored as not perfect based on the visual assessment of the current study. In most of the cases, the labels for the pancreas head were assessed as low-quality. It is presumed that the low-quality labeling is caused by the fact that annotators did not fully understand the boundary between the pancreas and other organs (e.g., duodenum). To evaluate the segmentation accuracy, reliable labeling is mandatory. For this purpose, a new database for pancreas segmentation is desirable.
There were several limitations to the present study. First, we investigated the usefulness of data augmentation only in segmentation models. The usefulness of data augmentation should be evaluated for other models such as classification, detection, and image generation. Second, the 3D fine-tuned model of pancreas segmentation was not evaluated. Because U-net, mixup, and RICAP were originally suggested for 2D models, we constructed and evaluated the 2D model of pancreas segmentation. We will apply the proposed methods to the 3D fine-tuned model in future research.

Conclusions
The combination of deep U-net with mixup and RICAP achieved automatic pancreas segmentation, which the radiologists scored as good or perfect. We will further investigate the usefulness of the proposed method for the 3D coarse-scaled/fine-scaled models to improve segmentation accuracy. Funding: The present study was supported by JSPS KAKENHI, grant number JP19K17232.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Table A1. Summary of fine-scaled models using Pancreas-CT dataset.
p-values less than 0.05/45 = 0.00111 was considered as statistical significance.