Deep Learning–Based Brain Computed Tomography Image Classification with Hyperparameter Optimization through Transfer Learning for Stroke

Brain computed tomography (CT) is commonly used for evaluating the cerebral condition, but immediately and accurately interpreting emergent brain CT images is tedious, even for skilled neuroradiologists. Deep learning networks are commonly employed for medical image analysis because they enable efficient computer-aided diagnosis. This study proposed the use of convolutional neural network (CNN)-based deep learning models for efficient classification of strokes based on unenhanced brain CT image findings into normal, hemorrhage, infarction, and other categories. The included CNN models were CNN-2, VGG-16, and ResNet-50, all of which were pretrained through transfer learning with various data sizes, mini-batch sizes, and optimizers. Their performance in classifying unenhanced brain CT images was tested thereafter. This performance was then compared with the outcomes in other studies on deep learning–based hemorrhagic or ischemic stroke diagnoses. The results revealed that among our CNN-2, VGG-16, and ResNet-50 analyzed by considering several hyperparameters and environments, the CNN-2 and ResNet-50 outperformed the VGG-16, with an accuracy of 0.9872; however, ResNet-50 required a longer time to present the outcome than did the other networks. Moreover, our models performed much better than those reported previously. In conclusion, after appropriate hyperparameter optimization, our deep learning–based models can be applied to clinical scenarios where neurologist or radiologist may need to verify whether their patients have a hemorrhage stroke, an infarction, and any other symptom.


Introduction
Brain computed tomography (CT) is a modality most commonly used for evaluating the cerebral condition [1]. It is more widely available, fast, and cost-effective than is brain magnetic resonance imaging. Although brain CT was developed in the 1970s, its widespread clinical use became achievable only recently, after the introduction of rapid, large-coverage multidetector-row CT scanners. Key clinical applications for brain CT include the diagnoses of cerebral hemorrhage and ischemia neoplasm and evaluation of the mass effect after hemorrhage, neoplasm, and cerebral edema secondary to ischemia. However, immediate and highly accurate interpretation of emergent CT images remains time-consuming and laborious, even for skilled neuroradiologists [2]. Lodwick described computer-aided diagnosis (CAD) for the first time. Since then, a wide variety of lesion detection systems have been reported [3,4]. The usefulness of CAD depends on the number of true-and false-positive markers. High-performance CAD systems are appreciated by radiologists in the screening practices [5]. For nodule detection in chest X-ray, CAD can even outperform the diagnosis efficiency of the unexperienced radiologists [6]. At present, some CAD systems have received approval from the U.S. Food and Drug Administration [7,8]. Compared with traditional CAD (which may be limited by detecting specific disease and requiring a work-alone station), deep learning can handle more complicated condition. With a relatively wide scope, deep learning delivers multiple answers. The software improvements over the last few decades have enabled not only a considerable amount of research on image processing algorithms and methodologies but also rapid, faultless identification and quantification of abnormalities in scanned regions [9].
Deep learning, a well-employed network for medicine [10][11][12], can outperform humans in diagnosis. Li et al. [13] proposed a U-net based model to identify cerebral hemorrhage, which has many advantages over human expertise, but it demands much manpower and time for segmentation. We aimed at developing a simple model, like a CNN-based system, to classify the results of brain CT as red-dot systems. Training a conventional convolutional neural network (CNN) from scratch typically requires a considerable amount of data. Nevertheless, through transfer learning, a small amount of data can become sufficient for finetuning a pretrained model [14]. For patients with preexisting cerebral changes, a final human check of the deep learning-based diagnosis is required to ensure credibility; nevertheless, it can improve the clinical decision-making of neuroradiologists [15]. Owing to the wide variety commercial models available, understanding the mechanism underlying the "black box" in computer operation is difficult. Moreover, studies that defined a reference to hyperparameter adjustment in the various models in certain conditions have been limited.
In this study, pretrained models including CNN-2, VGG-16, and ResNet-50-with varied data sizes, mini-batch sizes, and optimizers-were compared in terms of their performance in classifying unenhanced brain CT image findings into normal, acute or subacute hemorrhage, acute infarction, and other categories. We also reviewed other studies on deep learning-based diagnoses of hemorrhagic and ischemic stroke and compared their outcome with ours.

Data Collection
This retrospective study was approved by our institutional review board, which also waived the requirement for obtaining patient informed consent and using anonymized patient imaging data. Our dataset included 24,769 unenhanced brain CT images from 1715 patients collected over 1 July-1 October 2019. Of these images, 4382, 6102, 3860, and 2995 were from healthy patients, patients with hemorrhage, patients with infarction, and patients with other findings, respectively. The uneven ratio between each group also reflects the patient structure in our district. Moreover, only 10% of these images (n = 2476) constituted the testing dataset (Table 1), which was used to evaluate the objective performance to ensure that the accuracy of the data for a specific period was not represented. We labelled these images based on clinical results after at least a half-year follow up and split them randomly into training dataset, validation dataset, and testing dataset, based on the case-based level. Once the abnormal findings were included in case-based images, the results were defined as abnormal and further classified into hemorrhage, infarct, or others.
All images were confirmed by two radiologists, who had 1 year and 20 years of experience, to be ground truth for deep learning, with the other findings including craniotomy, deep brain implantation, severe motion, or artifacts.

Preprocessing Steps
The original images from the dataset were denominated and cropped to a standard size of 224 × 224 × 3 using MicroDicom, a multiphase algorithm used for image processing.

Data Normalization
Data normalization is an important step for numerical stabilization of a CNN; it changes the range of the pixel intensity values and converts an input image into a range of pixel values that is more familiar or normal to the senses. It is faster and more stable for gradient descent [16]. In the current study, the pixel value of input images was downsampled to 224 × 224 × 3 through simply scaling with centering in the range of 0-1 for each channel.

Data Augmentation
CNNs are data-hungry and can be more powerful and perform better with larger datasets [17]. However, collecting medical data can be difficult. Data augmentation, in which a series of transformation procedures are used while preserving the ordinary labels, has widely been applied to multiply the numbers of the images. This adjustment can also aid in preventing overfitting as a regularizer. We randomly applied each horizontal flipping, rotation, shift, zoom, and shear for each image. So, the range of our data size is about 1-to 2-fold of the original (24,767 × 1~24,767 × 2).  All images were confirmed by two radiologists, who had 1 year and 20 year perience, to be ground truth for deep learning, with the other findings including c omy, deep brain implantation, severe motion, or artifacts.

Preprocessing Steps
The original images from the dataset were denominated and cropped to a st size of 224 × 224 × 3 using MicroDicom, a multiphase algorithm used for image proc

Data Normalization
Data normalization is an important step for numerical stabilization of a C changes the range of the pixel intensity values and converts an input image into of pixel values that is more familiar or normal to the senses. It is faster and more st gradient descent [16]. In the current study, the pixel value of input images was dow pled to 224 × 224 × 3 through simply scaling with centering in the range of 0-1 f channel.

Data Augmentation
CNNs are data-hungry and can be more powerful and perform better with datasets [17]. However, collecting medical data can be difficult. Data augmenta which a series of transformation procedures are used while preserving the ordinary has widely been applied to multiply the numbers of the images. This adjustment c aid in preventing overfitting as a regularizer. We randomly applied each horizon ping, rotation, shift, zoom, and shear for each image. So, the range of our data size i 1-to 2-fold of the original (24,767 × 1~24,767 × 2).

Transfer Learning
Deep learning is data-dependent, which is the most challenging problem of this dality. Here, sufficient training with a large amount of information is required for the

Transfer Learning
Deep learning is data-dependent, which is the most challenging problem of this modality. Here, sufficient training with a large amount of information is required for the network to recognize data patterns. Ideally, both training and testing data are assumed to have same distribution and features. However, constructing sufficiently large datasets can be laborious, time-consuming, and even impossible in some cases. In this condition, transfer learning from one pretrained domain to another is beneficial and efficient [14,18].

Hyperparameters for CNN-Based Models
In this study, the included CNN models were the pretrained CNN-2, VGG-16, and ResNet-50 from ImageNet [19,20] (Table 2). They were tuned with the following hyperparameters: learning rate = 0.0001, max epochs = 50, and mini-batch sizes = 8, 16, 32, 64, and 128. Early stopping was applied if gradient exploding occurred. The dropout rates were 0.3 and 0.5 in all included models. ResNet-50 was applied without dropout for further comparison. The optimizers were respectively added with Adam [21], SGD [22], and RMSProp [23]. We applied a rectified linear unit (ReLU) activation function [24], converting the input weighted sum into the node's output, in each convolutional layer [25]. ReLU was implanted in the hidden layers of the CNN. We also divided our dataset sizes into four categories: <1000, 1000-5000, 5000-9000, and >10,000; these were compared in the best environment in each model. We want to compare the different amounts of dataset sizes in different models with the best environment.

Performance Evaluation
We evaluated our models in terms of the accuracy, precision, recall, and F1-score of the proposed CNN and other pretrained models and plotted curves for calculating the areas under the curves (AUCs), with epochs on X axis and accuracy on Y axis.

Training Performance
The training performance data, including training loss, validation loss, and validation accuracy, obtained by the selected models at different epochs are listed in Table 4 and illustrated in Figure 4. Moreover, Figure 5 presents the confusion matrices for all the models with testing data.

Training Performance
The training performance data, including training loss, validation loss, and validation accuracy, obtained by the selected models at different epochs are listed in Table 4 and illustrated in Figure 4. Moreover, Figure 5 presents the confusion matrices for all the models with testing data.    (Table 5).     (Table 5).

Comparison of Different Optimization Methods
The Adam optimizer was used for training all the models. To evaluate the classification effectiveness of this optimization method, the results were compared with those of other efficient optimization methods, such as SGD and RMSProp, as shown in Table 6. The Adam optimizer was found to provide the most efficient results.

Comparison of Different Data Sizes for Training
Training data sizes of <1000, 1000-5000, 5000-9000, and >10,000, respectively, provided a validation accuracy of 0.3312, 0.6111, 0.8655, and 0.9872 in the CNN-2; 0.5250, 0.7075, 0.8884, and 0.9575 in the VGG-16; and 0.5750, 0.7266, 0.9567, and 0.9872 in the ResNet-50 (Table 7). In general, the accuracy of all the included models increased in proportion to the data sizes, with the highest accuracy with data sizes of >10,000. Nevertheless, the accuracy values of the ResNet-50 for data sizes of 5000-9000 and >10,000 were very close, indicating that high accuracy (≥0.9567) can be achieved by using ResNet-50 even for a data size >5000. This is in contrast to the results for the CNN-2 and VGG-16, which could not achieve high accuracy for a data size <10,000.

Discussion
On the basis of the current results, CNN-based deep learning models can be used to detect strokes automatically and with high accuracy after hyperparameter optimization. We also compared the performance with different hyperparameters, regularizers, and data sizes.
Hemorrhagic stroke, a common and fatal disease, often presents symptoms similar to the more commonly diagnosed ischemic stroke. However, the treatment for hemorrhagic stroke is focused on controlling the bleeding from ruptured cerebral vasculatures or aneurysms, whereas that for ischemic stroke is focused on recannulating clot blockages in cerebral arteries. Misdiagnosis leading to the erroneous use of anticoagulant agents for treating a hemorrhagic stroke can cause death. Unenhanced brain CT is the most common and recommended test of choice to identify the two stroke types. Nevertheless, it is more difficult to make a diagnosis of subtle infarct only based on unenhanced CT. Schriger et al. [26] claimed that in the absence of support from a radiologist, the accuracy of this interpretation is only 0.67 among emergency physicians treating patients with stroke.
Since the advent of deep learning, the use of brain CT images for accurate prediction of critical anomalies has received considerable attention. Several attempts have thus been made to develop a reliable diagnosis model using deep learning methods. Transfer learning has also been extensively used with the recent CNN-based networks [27]. However, most of these methods have employed an imbalanced and limited amount of data, which has led to unsatisfactory results. In the current study, we developed a system that classifies hemorrhagic and ischemic strokes by using numerous brain CT images sampled uniformly from a patient population.
Here, we comprehensively evaluated the effectiveness of the three most efficient CNN models, namely CNN-2, VGG-16, and ResNet-50, in the classification of hemorrhagic and ischemic strokes from brain CT images after hyperparameter optimization. One of the most crucial hyperparameters considered in this study was the mini-batch size. We accordingly identified the mini-batch size that provided the highest validation accuracy for the CNN-2, VGG-16, and ResNet-50 with regard to classifying the brain CT images. Moreover, the best performing models in this study were found to be CNN-2 and ResNet-50 (highest accuracy = 0.9872). Grewal et al. [28] developed an automatic intracranial hemorrhage detection model based on deep learning, with a sensitivity of 0.8864 and a precision of 0.8124 in a dataset of 77 brain CT images interpreted by three radiologists. However, the authors included a small dataset and detected only hemorrhagic stroke in their analysis. Moreover, Prevedello et al. [29] assessed the performance of a deep learning algorithm to detect hemorrhage, mass effect, hydrocephalus, and suspected acute infarction by using a dataset of 50 brain CT images and reported AUCs of 0.91 for hemorrhage, mass effect, and hydrocephalus and only 0.81 for suspected acute infarction. In the current study, after optimization, all three models, trained with relatively more data, demonstrated outstanding performance, with F1 scores >0.95.
In addition to accuracy, efficiency is an important factor in medical image classification. In this study, the VGG-16 and CNN-2 required only about 2 and 8 min on average to provide the outcome, respectively, which is nearly 14 and 4 times faster than the time taken by the ResNet-50, respectively. We thus believe that this significant difference in time consumption occurs because of the relatively complicated structure of ResNet-50, with numerous hidden layers. If the data size is bigger, it costs several folds of time higher than ours, and the difference in time consumption is bigger.
The images that are false positive are illustrated in Figure 6. However, we cannot provide a clear explanation of why the classification failed, due to the mechanism underlying the "black box". There is no relationship between size, laterality, location, or augmentation process. It is possible that an increase in data size will achieve better performance. made to develop a reliable diagnosis model using deep learning methods. Transfer learning has also been extensively used with the recent CNN-based networks [27]. However, most of these methods have employed an imbalanced and limited amount of data, which has led to unsatisfactory results. In the current study, we developed a system that classifies hemorrhagic and ischemic strokes by using numerous brain CT images sampled uniformly from a patient population.
Here, we comprehensively evaluated the effectiveness of the three most efficient CNN models, namely CNN-2, VGG-16, and ResNet-50, in the classification of hemorrhagic and ischemic strokes from brain CT images after hyperparameter optimization. One of the most crucial hyperparameters considered in this study was the mini-batch size. We accordingly identified the mini-batch size that provided the highest validation accuracy for the CNN-2, VGG-16, and ResNet-50 with regard to classifying the brain CT images. Moreover, the best performing models in this study were found to be CNN-2 and ResNet-50 (highest accuracy = 0.9872). Grewal et al. [28] developed an automatic intracranial hemorrhage detection model based on deep learning, with a sensitivity of 0.8864 and a precision of 0.8124 in a dataset of 77 brain CT images interpreted by three radiologists. However, the authors included a small dataset and detected only hemorrhagic stroke in their analysis. Moreover, Prevedello et al. [29] assessed the performance of a deep learning algorithm to detect hemorrhage, mass effect, hydrocephalus, and suspected acute infarction by using a dataset of 50 brain CT images and reported AUCs of 0.91 for hemorrhage, mass effect, and hydrocephalus and only 0.81 for suspected acute infarction. In the current study, after optimization, all three models, trained with relatively more data, demonstrated outstanding performance, with F1 scores >0.95.
In addition to accuracy, efficiency is an important factor in medical image classification. In this study, the VGG-16 and CNN-2 required only about 2 and 8 min on average to provide the outcome, respectively, which is nearly 14 and 4 times faster than the time taken by the ResNet-50, respectively. We thus believe that this significant difference in time consumption occurs because of the relatively complicated structure of ResNet-50, with numerous hidden layers. If the data size is bigger, it costs several folds of time higher than ours, and the difference in time consumption is bigger.
The images that are false positive are illustrated in Figure 6. However, we cannot provide a clear explanation of why the classification failed, due to the mechanism underlying the "black box". There is no relationship between size, laterality, location, or augmentation process. It is possible that an increase in data size will achieve better performance.  Tandel et al. [30] reported that the simplest CNN-1 could classify benign and malignant gliomas with high efficiency from magnetic resonance images, with further high-efficiency subclassification into low-and high-grade malignant gliomas enabled by the CNN-2. Further segmentation of low-grade and high-grade malignant gliomas can be performed using the CNN-3 and CNN-4. They utilized artificial neural networks (ANNs) as feature extraction algorithms and a CNN as classifier, with high accuracy of 0.98. Despite the simple architecture of a CNN, it is effective in classifying gliomas from magnetic resonance images; hence, we considered it efficient to classify the four categories in our study.
Ioffe and Szegedy [31] claimed that removing dropout as an optimizer from ResNet allows the network to achieve increased accuracy. We noted similar results for our brain CT images (Table 3), even in different mini-batch sizes.
Activation function is key in deep learning architectures, and many types of nonlinear activation functions exist. Pedamonti [24] reported that both ReLU and LeakyReLU are suitable activation functions for CNN-based models, particularly deeper neural networks. In the current study, we also compared the different activation functions on VGG-16 for stroke classification, and no apparent difference between ReLU and LeakyReLU was noted; nevertheless, their performance was better than that of the sigmoid activation function. However, considering the wide range of activation functions available, including ReLU, LeakyReLU, ELU, SELU, and sigmoid, the activation function most suitable for classifying stroke images warrants further investigation.
The clinical application of our result is mainly as a classifier for radiologists, who can quickly issue a warning to clinicians. Rather than a "red-dot system", we will apply this result to further develop a system combined with the nature language process (NLP). Although understanding the mechanism underlying the "black box" in computer operation is difficult, we can supply many image inputs for NLP training through this classifier. We are committed to doing this in the future.

Limitations
Although deep learning has a considerable potential in medical applications, some related limitations include data availability and variability. At the institutional level, the contrast, noise, and resolution levels used for CT vary, and this can impede adaptation for deep learning. Moreover, data privacy is essential when considering the use of medical images for research and development, and this limits the amount of data available. Data generalization is achievable through transfer learning and data augmentation (both of which can produce additional features to learn), such that any related problem may be resolved within one day; however, this may take a long time and a considerable amount of available data to accomplish.

Conclusions
In this study, the use of CNN-based deep learning was proposed for efficient classification of hemorrhagic and ischemic stroke using unenhanced brain CT images. The CNN models CNN-2, VGG-16, and ResNet-50, pretrained through transfer learning, were analyzed by considering several hyperparameters and environments, and their results were compared. CNN-2 and ResNet-50 outperformed the VGG-16 with an accuracy of 0.9872; however, ResNet-50 required longer time than the other networks. After optimization, the tested models may be applied by radiologists to verify their screening results and thus reduce their workload. Our results also pave the way for further development of effective deep CNN models (using residual connections) for increasing the diagnosis accuracy for stroke. In the future, we will verify the effectiveness of our proposed models in terms of time required and performance and explore the use of optimization algorithms along with the models used in this study to design a more reliable model.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study prior to CT acquisition. Data Availability Statement: Not applicable.