Improvement in the Convolutional Neural Network for Computed Tomography Images

: Background and purpose. This study evaluated a modiﬁed specialized convolutional neural network (CNN) to improve the accuracy of medical images. Materials and Methods. We deﬁned computed tomography (CT) images as belonging to one of the following 10 classes: head, neck, chest, abdomen, and pelvis with and without contrast media, with 10,000 images per class. We modiﬁed the CNN based on the AlexNet with an input size of 512 × 512. We resized the ﬁlter sizes of the convolution layer and max pooling. Using these modiﬁed CNNs, various models were created and evaluated. The improved CNN was evaluated to classify the presence or absence of the pancreas in the CT images. We compared the overall accuracy, which was calculated from images not used for training, to that of the ResNet. Results. The overall accuracies of the most improved CNN and ResNet in the 10 classes were 94.8% and 89.3%, respectively. The ﬁlter sizes of the improved CNN for the convolution layer were (13, 13), (7, 7), (5, 5), (5, 5), and (5, 5) in order from the ﬁrst layer, and that of max-pooling was (7, 7). The calculation times of the most improved CNN and ResNet were 56 and 120 min, respectively. Regarding the classiﬁcation of the pancreas, the overall accuracies of the most improved CNN and ResNet were 75.75% and 58.25%, respectively. The calculation times of the most improved CNN and ResNet were 36 and 55 min, respectively. Conclusion. By optimizing the ﬁlter size of the convolution layer and max-pooling of 512 × 512 images, we quickly obtained a highly accurate medical image classiﬁcation model. This improved CNN can be useful for classifying lesions and anatomies for related diagnostic aid applications.


Introduction
Image classification is a typical technology of image analysis that uses artificial intelligence. In computed tomography (CT) images, it has been used to classify pulmonary nodules [1,2], slice positions [3,4], and calcaneal fractures [5,6]. Both AlexNet [7] and ResNet [8] are examples of image classification models. Because AlexNet has fewer layers than ResNet, its accuracy is low but the calculation time is short. Learning images with a large pixel size results in graphics processing unit (GPU) memory shortage, as the calculation cost is large. Therefore, images resized to 224 × 224, which is the default pixel size of most image classification models, are often used for training [9,10]. In the study by Santin et al. [9], augmentation of data and transfer learning were used to improve the accuracy and robustness of the model, but the pixel size of the input images was 224 × 224 × 3; thus, the pixel size was not examined. However, in medical images, the general pixel size is 512 × 512. Reducing the image size may reduce the number of features as a result of the compression of image information. Because several tens of thousands of images are required for medical image classification, learning takes a lot of time [11]. The existing image classification model, AlexNet, has low accuracy because it is an early model, but the calculation time is short because it has few layers. If AlexNet can be specialized to 512 × 512, two problems may be solved: the reduction of features due to resizing and the long calculation time, which is a weakness of models with a large number of parameters. Being able to train with the original size images is useful because, in actual diagnosis, micro lesions need to be detected. In this study, the parameters of AlexNet were customized for medical images, and the accuracy and calculation time of the convolutional neural network (CNN) were evaluated. The generalization capability of the improved CNN was evaluated by classifying the presence or absence of the pancreas, which is considered difficult [12].

Subjects and Datasets
In this study, we targeted 118,000 512 × 512 axial CT images. These images were approved by the Hokkaido University Hospital Ethics Committee. In the 10 classes for adjusting AlexNet parameters, 100,000 images were used for training, and 10,000 images were used for accuracy verification. In the classification of the presence or absence of the pancreas to evaluate the generalization capability of the improved CNN, 6000 images were used for training, and 2000 images were used for accuracy verification.

Ten Classes for Adjusting AlexNet Parameters
We defined training and accuracy verification images as the following 10 classes: head, neck, chest, abdomen, and pelvis with and without contrast media. Training images comprised 10,000 images per class, and accuracy verification images included 1000 per class. The original AlexNet was trained with 224 × 224 and 512 × 512 images, and the overall accuracy and calculation time of these two models were compared. The respective models were named original (input image size: 224 × 224) and original (input image size: 512 × 512). The AlexNet parameters adjusted for 512 × 512 images were the filter sizes of the convolution layer and max pooling. Because the filter sizes of the convolution layer had many change patterns, we divided them into five groups-A, B, C, D, and E-and named each model as A1 and A2 and so on. Figure 1 presents the change patterns of groups A to E. Figure 2 shows the original and changed values of the filter size of the convolution layer. On the other hand, the original value of the filter size of max-pooling was (3,3), which we changed to an odd number of 5 to 15. We calculated the overall accuracies of these models using accuracy verification images. Various models were created with all combinations of parameters exceeding the overall accuracy of the original (input image size: 512 × 512), and the overall accuracy was calculated. The names of these models were "group name of convolution layer"-"filter size of max pooling," such as A1-5. Next, the original ResNet was trained with 224 × 224 images, and the overall accuracy and calculation time were calculated. We compared the overall accuracy, confusion matrix, and calculation time of the model with the highest overall accuracy among the created models to those of the ResNet.

Classification of the Presence or Absence of the Pancreas to Evaluate the Generalization Capability of the Improved CNN
We defined training and accuracy verification images as the following four classes: the presence or absence of the pancreas with and without contrast media. Training images were 1500 images per class, and accuracy verification images were 500 per class. ResNet

Classification of the Presence or Absence of the Pancreas to Evaluate the Generalization Capability of the Improved CNN
We defined training and accuracy verification images as the following four classes: the presence or absence of the pancreas with and without contrast media. Training images were 1500 images per class, and accuracy verification images were 500 per class. ResNet and the model with the highest overall accuracy among the created models were trained, and we compared the overall accuracy, confusion matrix, and calculation time. Figure 3 shows the entire learning and evaluation process. For the training, we used a PC with NVIDIA GeForce GTX TITAN X 12GB (NVIDIA Corporation, Santa Clara, CA, USA). and the model with the highest overall accuracy among the created models were trained, and we compared the overall accuracy, confusion matrix, and calculation time. Figure 3 shows the entire learning and evaluation process. For the training, we used a PC with NVIDIA GeForce GTX TITAN X 12GB (NVIDIA Corporation, Santa Clara, CA, USA).  Table 1 shows the overall accuracies and calculation times of the original AlexNet trained with 224 × 224 and 512 × 512 images. In the comparison between original (input image size: 224 × 224) and original (input image size: 512 × 512), original (input image size: 512 × 512) had higher overall accuracy and a shorter calculation time. Figure 4 presents the overall accuracies and calculation times of the models in which the filter sizes of the convolution layer were changed, and Figure 5 shows the overall accuracies and calculation times of the models in which the filter sizes of max-pooling were changed. The models exceeding the overall accuracy of original (input image size: 512 × 512) were A1~3, B1~3, C1~2, D1~2, and E1, and the models with filter sizes of max-pooling were 5, 7, and 9. Figures 6-8 show the overall accuracies and calculation times of the models with all combinations of these parameters. Among these models, the highest overall accuracy was in model A1-7. The filter sizes of the convolution layer for A1-7 were (13, 13), (7, 7), (5, 5), (5,5), and (5,5) in order from the first layer, and that of max-pooling was (7,7). The overall accuracy of A1-7 was 94.40%, and the calculation time was 56 min. Figure 9 displays the confusion matrix. In contrast, the overall accuracy of ResNet was 88.80%, and calculation time was 120 min. The confusion matrix is shown in Figure 10. In the comparison of A1-7 and ResNet, A1-7 was superior in both overall accuracy and calculation time.  Table 1 shows the overall accuracies and calculation times of the original AlexNet trained with 224 × 224 and 512 × 512 images. In the comparison between original (input image size: 224 × 224) and original (input image size: 512 × 512), original (input image size: 512 × 512) had higher overall accuracy and a shorter calculation time. Figure 4 presents the overall accuracies and calculation times of the models in which the filter sizes of the convolution layer were changed, and Figure 5 shows the overall accuracies and calculation times of the models in which the filter sizes of max-pooling were changed. The models exceeding the overall accuracy of original (input image size: 512 × 512) were A1~3, B1~3, C1~2, D1~2, and E1, and the models with filter sizes of max-pooling were 5, 7, and 9. Figures 6-8 show the overall accuracies and calculation times of the models with all combinations of these parameters. Among these models, the highest overall accuracy was in model A1-7. The filter sizes of the convolution layer for A1-7 were (13, 13), (7, 7), (5, 5), (5,5), and (5,5) in order from the first layer, and that of max-pooling was (7,7). The overall accuracy of A1-7 was 94.40%, and the calculation time was 56 min. Figure 9 displays the confusion matrix. In contrast, the overall accuracy of ResNet was 88.80%, and calculation time was 120 min. The confusion matrix is shown in Figure 10. In the comparison of A1-7 and ResNet, A1-7 was superior in both overall accuracy and calculation time.

Classification of the Presence or Absence of the Pancreas for the Evaluation of the Generaliza tion Capability of the Improved CNN
The overall accuracy of A1-7 was 75.75%, and the calculation time was 36 min. Figure  11 shows the confusion matrix. The overall accuracy of ResNet was 58.25%, and the cal culation time was 55 min. Figure 12 displays the confusion matrix. In the comparison o A1-7 and ResNet, A1-7 was superior to ResNet in both overall accuracy and calculation time, as in the 10 classes.

Classification of the Presence or Absence of the Pancreas for the Evaluation of the Generalization Capability of the Improved CNN
The overall accuracy of A1-7 was 75.75%, and the calculation time was 36 min. Figure 11 shows the confusion matrix. The overall accuracy of ResNet was 58.25%, and the calculation time was 55 min. Figure 12 displays the confusion matrix. In the comparison of A1-7 and ResNet, A1-7 was superior to ResNet in both overall accuracy and calculation time, as in the 10 classes.

Discussion
In the comparison between original (input image size: 224 × 224) and original (input image size: 512 × 512), original (input image size: 512 × 512) was found to have higher overall accuracy and a shorter calculation time. This means that training with large pixelsized images may improve the overall accuracy and increasing the pixel size does not always increase the calculation time.
As a result of changing the filter size of the convolution layer, the overall accuracy was improved when the filter size was slightly increased, and it was decreased when the filter size was further increased in groups A to E. This result suggests that when training with images whose pixel size is larger than the original CNN, the overall accuracy is improved by appropriately increasing the filter size of the convolution layer. In the 512 × 512 image, the original filter size has a narrow range for the extraction of features, which makes it difficult to extract the overall features. Therefore, increasing the filter size improved the overall accuracy. However, if the filter sizes are made too large, detailed features cannot be extracted, and the overall accuracy is decreased. On the other hand, the calculation time became longer as the filter size increased. This is because the filter size was increased without changing the stride. The number of feature extractions was the same, but the range increased; thus, the calculation time became longer.
When the filter sizes of max-pooling were changed to an odd number from 5 to 15, the models with filter sizes 5, 7, and 9 exceeded original (input image size: 512 × 512), and the models after 11 were less than original (input image size: 512 × 512). This is because the overall features can be extracted by increasing the filter sizes of max-pooling to fit the 512 × 512 images. However, if the filter sizes of max-pooling are made too large, extraction of detailed features is not possible, and the overall accuracy is reduced. On the other hand, the calculation time was almost constant. Max-pooling is an operation used for the extraction of the maximum value, and the amount of the calculation is not large. Therefore, the calculation time was constant, even if the filter size was increased. When combining the filter sizes of the convolution layer and max-pooling, which had high overall accuracy, the overall accuracy of A1-7 was the highest. A1 and the model in which the filter size of max-pooling was 7 were not the most accurate models when changed separately. Combining the parameters of the model with the highest overall accuracy does not mean that a model with higher overall accuracy can be created. Thus, the overall accuracy varies depending on the affinity of the parameter combinations.
In the comparison of A1-7 and ResNet, A1-7 was superior in both overall accuracy and calculation time. According to the confusion matrix, ResNet often misclassified heads without contrast media as heads with contrast media. In contrast, the accuracy of A1-7 was 96%, which is about three times higher. Older CNNs, such as AlexNet, can exceed the overall accuracy of the relatively new ResNet by specializing the parameters to 512 × 512 images. Therefore, by specializing the parameters to the uncompressed pixel size, the overall accuracy of CNNs trained with compressed images might be improved.
Regarding the pancreas classification, A1-7 was superior to ResNet in both overall accuracy and calculation time, as in the 10 classes. According to the confusion matrix, the classification of images with the pancreas had about the same accuracy, but the accuracy of A1-7 in the classification of images without the pancreas was about twice that of ResNet. For this reason, A1-7 is not a CNN specialized for only 10 classes but is rather a generalized CNN.
In the study by Lakhani et al. [13], the authors used AlexNet and GoogLeNet [14] and created four types of image classification models with and without transfer learning, and the accuracy of the classification of tuberculosis was compared. Although the creation of multiple models and the accuracy comparison were similar to those in this study, Lakhani et al. did not change the CNN parameters, such as the filter size of the convolution layer. AlexNet, GoogLeNet, and transfer learning are technologies developed for general images and are not specialized for medical imaging. Therefore, in this study, we improved the accuracy by specializing the filter sizes of the convolution layer and max-pooling for medical images.
This study has five limitations. First, the comparison target was only ResNet. The created CNN was compared with ResNet, which is a typical example of high-performance CNN; however, the latest CNN is more accurate [15][16][17][18][19][20], so the created CNN should have been compared with the latest CNN. Second, we did not evaluate some parameters. However, considering that there is a limit to the number of parameters that can be evaluated individually and that we obtained a result of 94.40%, we believe that the number of parameters was sufficient. Third, we used the holdout validation. Ideally, k-fold crossvalidation should be used, but we chose the holdout validation study because there were too many models to validate. In general, the ratio of data sets for holdout validation is 80:20 [5,21], and the mean performance is obtained from multiple data sets. In this study, the ratio of data sets was not 80:20, and the number of data sets was one. However, since some papers use a 90:10 ratio of data sets [22,23] and others use various ratios [1,6,24], this is not considered a problem. As for the number of data sets, it is not a problem since there are papers that validate with only one [3,4,25]. Fourth, only one generalization capability test was used in this study. By classifying small lesions, it is possible to verify the ability to respond to minute changes and clinical practicality. Since the accuracy in actual diagnosis is not verified, the classification of lesions and malignancy needs to be verified before it is used in clinical practice. However, being able to train with the original pixel size without resizing is useful, because it may capture more minute features. Finally, we tested the generalization capability only on the model with the highest overall accuracy. Testing with other models could reveal the relationship between accuracy and parameters.
The CNN we created in this study can quickly create a model, even if it is trained with a large number of medical images. This feature has the potential to create an image classification model that can be updated daily by training with images taken on the same day. As a result, the created CNN can be optimized for the imaging method, rules, radiologist habits, and patient tendency for each facility and can contribute to the creation of a diagnostic support application specialized for each facility.

Conclusions
By optimizing the filter size of the convolution layer and max-pooling of 512 × 512 images, we were able to quickly obtain a highly accurate medical image classification model. This improved CNN can be useful for the classification of lesions and anatomies for related diagnostic aid applications.
Author Contributions: K.M. contributed to data analysis, algorithm construction, and the writing and editing of the article; Y.A. and T.Y. contributed to reviewing and editing the paper; H.S. proposed the idea and contributed to data acquisition, performed supervision, project administration, and reviewed and edited the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Hokkaido University Hospital Ethics Committee.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The created models in this study are available on request from the corresponding author. However, the image datasets presented in this study are not publicly available due to ethical reasons, e.g., containing information that could compromise the privacy of research participants.

Conflicts of Interest:
The authors declare that no conflicts of interest exist.