Intracranial Hemorrhage Segmentation Using Deep Convolutional Model

Traumatic brain injuries could cause intracranial hemorrhage (ICH). ICH could lead to disability or death if it is not accurately diagnosed and treated in a time-sensitive procedure. The current clinical protocol to diagnose ICH is examining Computerized Tomography (CT) scans by radiologists to detect ICH and localize its regions. However, this process relies heavily on the availability of an experienced radiologist. In this paper, we designed a study protocol to collect a dataset of 82 CT scans of subjects with traumatic brain injury. Later, the ICH regions were manually delineated in each slice by a consensus decision of two radiologists. Recently, fully convolutional networks (FCN) have shown to be successful in medical image segmentation. We developed a deep FCN, called U-Net, to segment the ICH regions from the CT scans in a fully automated manner. The method achieved a Dice coefficient of 0.31 for the ICH segmentation based on 5-fold cross-validation. The dataset is publicly available online at PhysioNet repository for future analysis and comparison.


Introduction
Traumatic brain injury (TBI) is a major cause of death and disability in the United States. It contributed to about 30% of all injury deaths in 2013 [1]. After accidents with TBI, extra-axial intracranial lesions, such as intracranial hemorrhage (ICH), may occur. ICH is a critical medical lesion that results in a high rate of mortality [2]. It is considered to be clinically dangerous because of its high risk for turning into a secondary brain insult that may lead to paralysis and even death if it is not treated in a time-sensitive procedure. Depending on its location in the brain, ICH is divided into five sub-types: Intraventricular (IVH), Intraparenchymal (IPH), Subarachnoid (SAH), Epidural (EDH) and Subdural (SDH). In addition, the ICH that occurs within the brain tissue is called Intracerabral hemorrhage.
The Computerized Tomography (CT) scan is commonly used in the emergency evaluation of subjects with TBI for ICH [3]. The availability of the CT scan and its rapid acquisition time makes it a preferred diagnostic tool over Magnetic Resonance Imaging for the initial assessment of ICH. CT scans generate a sequence of images using X-ray beams where brain tissues are captured with different intensities depending on the amount of X-ray absorbency (Hounsfield units (HU)) of the tissue. CT scans are displayed using a windowing method. This method transforms the HU numbers into grayscale values ([0, 255]) according to the window level and width parameters. By selecting different window parameters, different features of the brain tissues are displayed in the grayscale image (e.g., brain window, stroke window, and bone window) [4]. In the CT scan images based on the brain window, the ICH regions appear as hyperdense regions with a relatively undefined structure. These CT images are examined by an expert radiologist to determine whether ICH has occurred and if so, detect its type and region. However, this diagnosis process relies on the availability of a subspecialty-trained neuroradiologist, and as a result, could be time inefficient and even inaccurate, especially in remote areas where specialized care is scarce.
Recent advances in convolutional neural networks (CNN) have demonstrated that the method has excellent performance in automating multiple image classification and segmentation tasks [5]. Hence, we hypothesized that deep learning algorithms have the potential to automate the diagnosis procedure for segmenting the ICH regions. We developed a fully convolutional network (FCN) to segment the ICH regions in each CT slice. Such a method could help reducing the time and error in the ICH diagnosis significantly where expert radiologists are not readily available . An automated ICH screen tool can be used to assist junior radiology trainees in detecting ICH and its sub-types, or when experts are not immediately available in the emergency rooms, especially in developing countries or remote areas.
Furthermore, there is only one publicly available dataset called CQ500 for the detection of ICH sub-types [6] that consists of 491 head CT scans. There is no publicly available dataset for the ICH segmentation. Hence, there is a need for a benchmark dataset that could help to extend the work in ICH segmentation. Therefore, the other focus of this work was collecting head CT scans with ICH segmentation and making it publicly available. We also performed a comprehensive literature review in the area of ICH detection and segmentation.

Intracranial Hemorrhage Detection
Several traditional and deep learning approaches were developed in the literature. Regarding the traditional machine learning methods, Yuh and colleagues developed a threshold-based algorithm to detect ICH. Later, the method detected the ICH sub-types based on its location, shape, and volume [8]. The authors optimized the value of the threshold using the retrospective samples of 33 CT scans and evaluated their model on 210 CT scans of subjects with suspected TBI. Their algorithm achieved 98% sensitivity and 59% specificity for the ICH detection and an intermediate accuracy in detecting the ICH sub-types. In another work, Li and colleagues proposed two methods to segment the SAH space and then used the segmented regions to detect the SAH hemorrhage [9,10]. One method used elastic registration with the SAH space atlas, whereas the other method extracted distance transform features and trained a Bayesian decision method to perform the delineation. After the SAH space segmentation, mean gray value, variance, entropy, and energy were extracted and used to train a support vector machine classifier for the SAH hemorrhage detection. They used 60 CT scans (30 with SAH hemorrhage) to train the algorithm and tested the model on 69 CT scans (30 with SAH hemorrhage). The best performance was reported using the Bayesian decision method with 100% testing sensitivity, 92% specificity, and 91% accuracy [10].
Regarding the deep learning approaches, all the methods were based on CNN and its variants except for the approaches in Refs. [19,23,25] , which were based on a FCN model. In these approaches, the spatial dependency between the adjacent slices was considered using a second model such as random forest [6] or RNN [13,18]. Some authors also modified CNN to process some part or the entire CT scan [14,15] or used an interpolation layer [17]. Other approaches were 1-stage , meaning that they did not consider the spatial dependency between the slices [12,19,23]. Prevedello and colleagues proposed two algorithms based on CNN [12]. One of their algorithms was focused on detecting ICH, mass effect, and hydrocephalus at the CT scans while their other algorithm was developed to detect the suspected acute infarcts. A total of 246 CT scans were used for training and validation (100 hydrocephalus, 22 suspected acute infarct, and 124 noncritical findings), and a total of 100 CT scans were used for testing (50 hydrocephalus, 15 SAI, and 35 noncritical findings). The testing predictions were validated with the final radiology report or with the neuroradiologist' review for the equivocal findings. The hydrocephalus detection algorithm yielded 90% sensitivity, 85% specificity, and the area under the curve (AUC) of 0.91. The suspected acute infarct detection algorithm resulted in a lower specificity and AUC of 0.81.
Chilamkurthy and colleagues proposed four algorithms to detect the sub-types of ICH, calvarial fractures, midline shift, and mass effect [6]. They trained and validated the algorithms on a large dataset with 290k and 21k CT scans, respectively. Two datasets were used for testing. A part of the testing, a dataset with 491 scans was made public (called CQ500). Clinical radiology reports were used as the gold standard to label the training and validation CT scans. These reports were used to label each scan utilizing a natural language processing algorithm. The testing scans were then annotated by the majority vote of the ICH sub-types reported by three expert radiologists. Different deep models were developed for each of the four categories. ResNet18 was trained with five parallel fully connected layers as the output layers. The results of these output layers for each slice were fed to a random forest algorithm to predict the scan-level confidence for the presence of an ICH . They reported an average AUC of 0.93 for the ICH sub-type detection on both datasets. Considering the high sensitivity operating point, the average sensitivity was 92%, which was similar to that of the radiologists. However, the average specificity was 70%, which was significantly lower than the golden standard. Also, it varied for different ICH sub-types. The lowest specificity of 68% was for the SDH detection.
Two approaches based on CNN with RNN were proposed to detect ICH [13,18]. Grewal et al. [13] proposed a 40-layer CNN, called DenseNet, with Bidirectional long short-term memory (LSTM) layer for the ICH detection. They also introduced three auxiliary tasks after each Dense Convolutional block to compute the binary segmentation of the ICH regions. Each of these tasks consisted of one convolutional layer followed by a deconvolution layer in order to upsample the feature maps to the original image size. The LSTM layer was added to incorporate the inter-slice dependencies of the CT scans of each subject . They considered 185 CT scans for training, 67 for validation, and 77 for testing. The training data was augmented by rotation and horizontal flipping to balance the number of scans for each of the two classes. The network detection of the test data was evaluated against the annotation of three expert radiologists for each CT slice. They reported 81% accuracy, 88% recall (sensitivity), 81% precision, and 84% F1 score. The model F1 score was higher than two of the three radiologists. Also, adding attention layers provided a significant increase in the model sensitivity. In [18], the authors presented a 3D joint convolutional and recurrent neural network (CNN-RNN) to detect and classify [18] ICH regions. The overall architecture of this model was similar to the model proposed by Grewal et al. [13]. VGG-16 was used as the CNN model, and bidirectional Gated Recurrent Unit (GRU) was used as the RNN model. RNN layer had the same functionality of the slice interpolation technique proposed by [17], but it was more flexible with respect to the number of adjacent slices included in the classification. The algorithm was trained and validated on 2,537 CT scans and tested on 299 CT scans. They reported a precise slice-level ICH detection with 99% for both sensitivity and specificity and an AUC of 1. However, for classification of the ICH sub-types, they reported a lower performance with 80% average sensitivity, 93.2% average specificity, and an AUC of 0.93. The lowest sensitivity was reported for SAH and EDH, which was 69% for both sub-types.
In three approaches, the CNN model was modified to process a number of CT slices at once [14][15][16]. Jnawalia and colleagues [14] proposed an ensemble of three different CNN models to perform the ICH detection. The CNN models were based on the architectures of AlexNet and GoogleNet that were extended to a 3D model by taking all the slices for each CT scan. They also have a lower number of parameters by reducing the number of layers and filter specifications. They trained, validated, and tested their model on a large dataset with 40k CT scans. About 34k CT scans were used for training (26K normal scans). However , the method that was used to label the CT scans was not reported. The positive slices were oversampled and augmented to make a balanced training dataset. About 2k and 4k scans were used for validation and testing, respectively. The AUC of the ensemble of the CNN models was 87% with the precision of 80%, recall of 77%, and F1-score of 78%. Chang and colleagues also developed a deep learning algorithm to detect ICH and its sub-types (except for IVH) with an ability to segment the ICH regions and quantify the ICH volume [15]. Their deep model is based on a region-of-interest CNN that estimates regions that contain an ICH for each five CT slices and then generates a segmentation mask for the positive cases of ICH. The authors trained their algorithm on a dataset with 10k CT scans and tested it on a prospective dataset of 862 CT scans. The reported 95% sensitivity, 97% specificity, and an AUC of 0.97 for the classification of ICH sub-types and an average Dice score of 0.85 for the ICH segmentation. The lowest detection sensitivity of 90% and Dice score of 0.77 were reported for SAH. In [16], an ensemble of four 3D CNN models with an input shape of 24 × 256 × 256 was implemented and evaluated using 9,499 retrospective and 347 prospective CT scans. An AUC of 0.846 was achieved on the retrospective study, and an average sensitivity of 71.5% and specificity of 83.5% were obtained on both testing datasets.
Similar to the work of Jnawalia et al. [14], Lee and colleagues used transfer learning on an ensemble of four well-known CNN models to detect the ICH sub-types and bleeding points [17]. The four models were VGG-16, ResNet-50, Inception-v3, and Inception-ResNet-v2. the spatial dependency between the adjacent slices was taken into consideration by introducing a slice interpolation technique. This ensemble model was trained and validated using a dataset with 904 CT scans and tested using a retrospective dataset with 200 CT scans and a prospective dataset with 237 scans. On average, the ICH detection algorithm resulted a testing AUC of 0.98 with 95% sensitivity and specificity. However, the algorithm resulted in a significantly lower sensitivity for the classification of the ICH sub-types with 78.3% sensitivity and 92.9% specificity. The lowest sensitivity of 58.3% was reported for the EDH slices in the retrospective test set and 68.8% for the IPH slices in the prospective test set. The overall localization accuracy of the attention maps was 78.1% between the model segmentation and the radiologists' maps of bleeding points.
The traditional methods usually require preprocessing of the CT scans to remove the skull and noise. They also require to register the segmented brains and extract some complicated engineered features. Many of these methods are based on unsupervised clustering to segment the ICH regions [11,20,21,26]. The methods in Ref. [20] and [11] both use the Distance Regularized Level Set Evolution (DRLSE) method to fit active contours on ICH regions. Prakash and colleagues modified DRLSE for the segmentation of the IVH and IPH regions after preprocessing the CT scans for the skull removal and noise filtering [20]. Validating the method on 50 test CT scans resulted in an average Dice coefficient of 0.88, 79.6% sensitivity, and 99.9% specificity. Shahangian and colleagues used DRLSE for the segmentation of the EDH, IPH and SDH regions and also proposed a supervised method based on support vector machine for the classification of the ICH slices [11]. The first step in their method was segmenting the brain by removing the skull and brain ventricles. Next, they performed the ICH segmentation based on DRLSE. Then, they extracted the shape and texture features of the ICH regions, and finally, they performed the ICH detection. This method resulted in an average Dice coefficient of 58.5, 82.5% sensitivity, and 90.5% specificity on 627 CT slices. The other traditional unsupervised studies [21] [26] used a fuzzy c-means clustering approach for the ICH segmentation. The authors in Ref. [21] proposed a method based on a spatial fuzzy c-means clustering and region-based active contour model. A retrospective set of 20 CT scans with an ICH was used. The authors reported 79% sensitivity, 99% specificity, and an average Jaccard index of 0.78. Similarly, Gautam and colleagues proposed a method based on the white matter fuzzy c-means clustering followed by a wavelet-based thresholding technique [26]. They evaluated their method on 20 CT scans with an ICH and reported a Dice coefficient of 0.82.
Unlike the unsupervised methods, the traditional supervised approaches [7,22] use labeled slices to train the classifiers. The authors in [7] proposed a semi-automatic ICH segmentation method where the brain in each CT slice was first segmented and aligned. Then, the candidate ICH regions were selected using top-hat transformation and extraction of the asymmetrical high intensity regions. Finally, the candidate regions were fed to a knowledge-based classifier for the ICH detection. This method resulted in 100% slice-level sensitivity, 84.1% slice-level specificity, and 82.6% lesion-level sensitivity. In another work, Muschelli and colleagues [22] proposed a fully-automatic method. They compared multiple traditional supervised methods for the segmentation of intracerebral hemorrhage [22]. For this purpose , the brains were first extracted from the CT scans and registered using a CT brain-extracted template. Next, multiple features were extracted from each scan. The features consisted of threshold-based information of the CT voxel intensity, local moment information, such as mean and std, within-plane standard scores, initial segmentation using an unsupervised model, contralateral difference images, distance to the brain center, and the standardized-to-template intensity that contrast a given CT scan with an averaged CT scans from healthly subjects. The classification models considered in this study were logistic regression, generalized additive model, and random forest. These models were trained on 10 CT scans and tested on 102 CT scans. Random forest resulted in the highest Dice coefficient of 0.899.
The deep learning approaches for the ICH segmentation were either based on CNN [15,17,24] or the FCN design [19,23,25]. In the previous section, two methods for the ICH segmentation based on CNN were reviewed [15,17]. Another work was developed by Nag and colleagues where the authors first selected the CT slices with an ICH using a trained autoencoder and then segmented the ICH areas using the active contour Chan-Vese model [24]. A dataset with 48 CT scans was used to evaluate the method. The autoencoder was trained on half of the data and all the data was used to test the algorithm. This work reported a sensitivity of 71%, positive predictive value of 73%, and Jaccard index of 0.55.
FCN provides an ability to predict the presence of ICH at the pixel level. This ability of FCN can also be used for the ICH segmentation. Several architectures of FCN were used for the ICH segmentation as follows: dilated residual net (DRN) [23], modified VGG16 [19], and U-Net [25]. The authors in Ref. [23] proposed a cost-sensitive active learning system. The system consisted of the ensemble of a patch fully CNN (PatchFCN). After the PatchFCN, the uncertainty score was calculated for each patch, and the sum of these patches' scores was maximized under the estimated labeling time constraint. The authors used 934 CT scans for training and validating purposes and 313 retrospective scans and 100 prospective scans for testing purposes. They reported 92.85% average precision for the ICH detection at scan level using both test sets and 77.9% average precision for the segmentation. The application of the cost-sensitive active learning technique improved the model performance on the prospective test set by annotating the new CT scans and increasing the size of the training data/scans. In [19], the CNN cascade model was used for the ICH detection and the dual FCN models was used for the ICH segmentation. The CNN cascade model was based on the GoogLeNet network, and the dual FCN model was based on a pre-trained VGG16 network that was modified and fine-tuned on the brain and stroke window settings. The methods were evaluated using 5-fold cross-validation of about 6k CT scans. The authors reported a sensitivity and specificity of about 98% for the ICH binary classification and an accuracy ranging from 70% to 90% for the ICH sub-type detection. The lowest accuracy was reported for the EDH detection. For the ICH segmentation, they reported 80.19% precision and 82.15% recall. Kuang and colleagues proposed a semi-automatic method to segment the regions of intracerebral hemorrhage in addition to the ischemic infarct segmentation [25]. The method consisted of U-Net models for the ICH and infarct segmentation that was fed beside a user initialization of the ICH and infarct regions to a multi-region contour evolution. A set of hand-crafted features based on the bilateral density difference between the symmetric brain regions in the CT scan was introduced into the U-Net. Also, the authors weighted the U-Net cross-entropy loss by the Euclidean distance between a given pixel and the boundaries of the true masks. The proposed semi-automatic method with the weighted loss outperformed the traditional U-net where it achieved a Dice similarity coefficient of 0.72. Table 1 and 2 summarize the methods for the ICH detection and segmentation. As expected, a high testing sensitivity and specificity was reported on large datasets, and the performance of the ICH detection algorithms was equivalent to the results from the senior expert radiologists [6,15,17,18]. However, the sensitivity of the detection of some ICH sub-types was equivalent to the the results from the junior radiology trainees [18]. SAH and EDH were the most difficult ICH sub-types to be classified by all the machine learning models [15,17,18,25]. It is interesting to note that SAH is also reported to be the most miss-classified sub-type by radiology residents [28]. For the ICH segmentation, the machine learning methods achieved a relatively high performance [15,[19][20][21][22][23]25,26]. However, there is still a need for a method that can precisely delineate the regions of all ICH sub-types. To address this need, we first collected a dataset of CT scans, which is available online at https://physionet.org/content/ct-ich/1.2.0/. Then, we implemented a fully convolutional network, known as U-Net, for the ICH segmentation.

Dataset
A retrospective study was designed to collect head CT scans of subjects with TBI. The study was approved by the research and ethics board in the Iraqi ministry of health-Babil Office. The CT scans were collected between February and August 2018 from Al Hilla Teaching Hospital-Iraq. The CT scanner was Siemens/ SOMATOM Definition AS which had an isotropic resolution of 0.33 mm, 100 kV, and a slice thickness of 5mm. The information of each subject was anonymized. A total of 82 subjects (46 male) with an average age of 27.8±19.5 years were included in this study (refer to Table 3 for the subject demographics). Each CT scan includes about 30 slices. Two radiologists annotated the non-contrast CT scans and recorded the ICH sub-types if an ICH was diagnosed. The two radiologists reviewed the non-contrast CT scans together and at the same time. Once they reached a consensus on the ICH diagnosis, which consisted of the presence of the ICH and its shape and location, the delineation of the ICH regions was performed to reduce the effort and time in the ICH segmentation process. The radiologists did not have access to the clinical history of the subjects.
During the data collection process, Syngo by Siemens Medical Solutions was first used to read the CT DICOM files and save two videos (AVI format), one using the brain window (level=40, width=120) and one using the bone window (level=700, width=3200). Second, a custom tool was implemented in Matlab and used to perform the following tasks: reading the AVI files, switching between the two window Table 1. Review of the methods proposed for the ICH detection and segmentation. Some papers used retrospective and prospective sets to test their models (i.e., retrospective + prospective), so the reported results are the average of both sets.   level settings, navigating between the slices, recording the radiologist annotations, delineating the ICH regions, and saving them as the binary 650x650 masks (JPG format). The gray-scale 650x650 images (JPG format) for each CT slice were also saved for both brain and bone windows (please refer to the supplement document for more details about the data collection process). Out of all the 82 subjects, 36 of the cases were diagnosed with an ICH and the following types: IVH, IPH, SAH, EDH, and SDH. See Figure 2 for some examples. One of cases had a chronic ICH, and it was excluded from this study. Table 4 shows the number of slices with and without an ICH as well as the numbers with different ICH sub-types. It is important to note that the number of the CT slices for each ICH sub-type in this dataset is not balanced as the majority of the CT slices do not have an ICH. Besides, IVH was only diagnosed in five subjects and the SDH hemorrhage in only four subjects. Also, some slices were annotated with two or more ICH sub-types. The dataset is release in JPG and NIfTI formats at PhysioNet (https://physionet.org/content/ct-ich/1.2.0/), which is a repository of freely-available medical research data. The license is Creative Commons Attribution 4.0 International Public License.

ICH Segmentation Using U-Net
Fully Convolutional Network (FCN) is an end-to-end or 1-stage algorithm used for semantic segmentation. Recently, FCN has exceeded the state-of-art performance in many applications involving delineation of the objects. For biomedical image segmentation, U-Net as a type of FCN was shown to be effective on small training datasets [29], which motivated us to use it for the ICH segmentation in our study . In this work, we investigated the first application of U-Net for the ICH segmentation. The architecture of U-Net is illustrated in Figure 3. Up-conv 2x2 1x1 conv Figure 3. The architecture of U-Net proposed in this study. Each CT slice is divided into 16 windows before feeding them to the U-Net for the ICH segmentation.
The architecture is symmetrical because it builds upon two paths: a contracting path and an expansive path. In the constructing path, four blocks of typical components of a convolutional network are used. Each block is constructed by two 3 × 3 convolutional filtering layers along with padding, which is followed by a rectified linear unit (ReLU) and then by a 2 × 2 max-pooling layer. In the expansive path, four blocks are also built that consist of two 3 × 3 convolutional filtering layers followed by ReLU layers. Each block is preceded by upsampling the feature maps followed by a 2 × 2 convolution (up-convolution), which are then concatenated with the corresponding cropped feature map from the contracting path. The skip connections between the two paths are intended to provide the local or the fine-grained spatial information to the global information while upsampling for the precise localization. After the last block in the expansive path, the feature maps are first filtered using two 3 × 3 convolutional filters to produce two images; one is for the ICH regions and one for the background. The final stage is a 1 × 1 convolutional filter with a sigmoid activation layer to produce the ICH probability in each pixel. In summary, the network has 24 convolutional layers, four max-pooling layers, four upsampling layers, and four concatenations. No dense layer is used in this architecture, in order to reduce the number of parameters and computation time.

Experiments
No preprocessing was applied on the original CT slices, except removing 5 pixels from the image borders that include only the black regions. The resulted shape of the CT slices was 640 × 640. Three experiments were performed to validate the performance of U-Net and compare it with a simple threshold-based method. In the first experiment, a grid search was implemented to select the lower and upper thresholds of the ICH regions. The thresholds that resulted in the highest Jaccard index on the training data were selected and and used in the testing procedure.
In the second experiment, the U-Net was trained and tested using the full 640 × 640 CT slices. However, we expected that this model will be biased to the negative class because only small number of pixels belong to the positive class in each CT scan. For the same reason, the authors in Ref. [23] used 160 × 160 crops instead of the entire CT slice and achieved a preciser model. Using this approach can also balance the training data by undersampling the negative crops. Therefore, in the third experiment, each slice from the CT scan was first divided using 160 × 160 window with an stride 80. This process resulted in 49 overlapped windows of size 160 × 160, which were then passed through U-Net for the ICH segmentation. Later, the segmented windows of each CT scan were combined to produce full 640 × 640 ICH masks. Finally, two consecutive morphological operations were performed on the ICH masks: closing to fill in the gaps in the ICH regions and opening to remove outliers and non-ICH regions.
For the evaluation purposes, we used slice-level Jaccard index (Eqn. 1) and Dice similarity coefficient (Eqn. 2) to quantify how well the model segmentation on each CT slice fits the ground truth segmentation.
where R ICH andR ICH are the segmented ICH performed by the neurologists and U-Net, respectively.

Results
Subject-based, 5-fold cross-validation was used to train, validate, and test the developed model for all the experiments. For the first experiment, a grid search was implemented to select a lower threshold in a 100 to 210 range, and an upper threshold in 210 to 255 range. The selected thresholds which were 140 and 230 resulted in a testing Jaccard index of 0.08 and Dice coefficient of 0.135.
For the second and third experiments, the U-Net architecture illustrated in Figure 3 was implemented in the Python environment using Keras library with TensorFlow as backend [30]. The shape of the input image was 640 × 640 in the second experiment and 160 × 160 in the third experiment. The 640 × 640 CT slices or the 160 × 160 windows and their corresponding segmentation masks were used to train the network in each experiment. In our dataset, 36 subjects out of 82 were diagnosed with an ICH, resulting in only 318 ICH slices out of 2491 (i.e., less than 10% of the images). In order to address the class-imbalance issue, a random undersampling approach was applied to the training data to reduce the number of 640 × 640 CT slices or 160 × 160 windows that do not have an ICH.
At every cross-validation iteration, one fold of the CT scans was left as a held-out set for testing, one fold for validation, and three folds were used for the training purposes. U-Net was trained for 150 epochs on the 640 × 640 CT slices or 160 × 160 windows and their corresponding segmentation windows Using GeForce RTX2080 GPU with 11 GB of memory. The training stage took approximately 5 hours in each cross-validation iteration. During the training and at each iteration, random slices were selected from the training data, and a data augmentation was performed randomly from the following linear transformations: • Rotation with maximum 20 degrees The dataset has a wide range of ages, which implies a wide range of head shapes and sizes, thus zooming and shearing were applied for the augmentation. Also, the head orientation could be different from subject to subject. Hence, rotation as well as width and height shifts were applied to increase the model generalizability. These linear transformations yield valid CT slices as would present in real CT data. It is worth mentioning that the non-linear deformations may provide slices that would not be seen in real CT data. As a result, we only used linear transformations in our analysis. In addition, all the subjects entered the CT scanner with their heads facing to the same direction. So the horizontal flipping will lead to CT slices that will not be generated in the data acquisition process. That is why we did not use it as an augmentation method.
Adam optimizer was used with cross-entropy loss and 1e-5 learning rate. A mini-batch of size 2 was used for the second experiment and 32 in the third experiment. The trained model was validated after each epoch. The best-trained model that resulted in the lowest validation Jaccard index was saved and used for testing purposes. The training evaluation metric was the average cross-entropy loss.
For the second experiment when the full CT slices were used, the U-Net failed to detect any ICH region and resulted in only black masks. The reason was that even though we used only the CT slices with an ICH in the training phase, these CT slices still had very few pixels that belonged to the positive class. As a result, the training dataset was biased toward the negative class significantly. Windowing the CT slices in the third experiment improved this biasing issue by undersampling the negative crops. The 5-fold cross-validation of the developed U-Net resulted in a better performance for the third experiment as shown in Table 5. The testing Jaccard index was 0.21 and the Dice coefficient was 0.31. The slice-level, sensitivity was 97.2% and specificity was 50.4%. Increasing the threshold on the predicted probability masks yielded a better testing specificity at the expense of the testing sensitivity as shown in Table 6. Figure 4 provides the segmentation result of the trained U-Net on some test 160 × 160 windows along with the radiologist delineation of the ICH. The boundary effect of each predicted 160 × 160 mask was minimal. The boundaries show low probabilities for the non-ICH regions instead of zero, and they were zeroed out after thresholding and performing the morphological operations. The final segmented ICH regions after combining the windows, thresholding, and performing the morphological operations for some CT slices are shown in Figure 5. As shown in this figure, the model matched the radiologist ICH segmentation perfectly in the slices shown on the left side, but there are some false-positive ICH regions in the right-side slices. Note that the CT slice in Figure 5, bottom right panel, shows the ending of an EDH region where the model only segments part of it.
The results based on the ICH sub-type showed that the U-Net performed the best with a Dice coefficient of 0.52 for the ICH segmentation of the subjects who had a SDH. The average Dice scores for the ICH segmentation of the subjects who had an EDH, IVH, IPH and SAH were 0.35, 0.3, 0.28 and 0.23, respectively. The minimum Dice coefficient and Jaccrad index in Table 5 was zero when the U-Net failed to localize the ICH regions in the CT scans of two subjects. One of the subjects had only a small IPH region in one CT slice, and the other subject had only a small IPH region in two CT slices. The results based on the subjects' age shows that the Dice coefficient of the subjects younger than 18 years is 0.321 and for the subjects older than 18, it is 0.309. This analysis confirms that there is no significant difference between the method's performance for the subjects younger and older than 18 years.

Discussions
A protocol was designed to collect head CT scans from subjects who had a TBI to diagnose the presence of an ICH, segment the ICH regions, and detect its sub-types. A total of 82 CT scans were collected where an ICH region was detected in 36 of them. Later, the dataset was used to train and evaluate a threshold-based method and a U-Net network based on 5-fold cross-validation. U-Net was trained on the full CT slice in one experiment and on 160 × 160 crops in another experiment. In the latter, each CT scan was divided into 160 × 160 overlapped windows, and an undersampling technique of the negative class (non-ICH regions) was performed to compensate for the data imbalance.
The U-Net model based on 160 × 160 crops of the CT slices resulted in a Dice coefficient of 0.31 for the ICH segmentation and a high sensitivity for detecting the ICH regions to be considered as the baseline for this dataset. This performance is comparable to the deep learning methods in the literature that were trained on small datasets [24,25]. Kuang and colleagues reported a Dice coefficient of 0.65 when a semi-automatic method based on U-Net and a contour evolution were used for the ICH segmentation. They reported a Dice coefficient of 0.35 when only U-Net was used [25]. The performance of the U-Net trained in our study is comparable to their results considering that we used a smaller dataset that had all the ICH sub-types and not only intracerebral hemorrhage. Also, [24] tested autoencoder and active contour Chan-Vese model on a dataset that did not contain any SDH cases and reported an average Jaccard index of 0.55. The autoencoder was trained on half of the dataset, and later all the dataset was used for testing, Windows from the CT slices with an ICH delineation as confirmed by the radiologists ICH segmentation by the U-net  which could boost the average Jaccard index. The other deep learning-based models in Ref. [15,17,19,23] were trained and tested on larger datasets and achieved higher performance for the ICH segmentation. Ref. [15] reported an average Dice coefficient of 0.85, Ref. [17] reported a 78% overlap between the attention maps of their CNN model and the gold-standard bleeding points, Ref. [23] reported 78% average precision, and [19] reported 80.19% precision and 82.15% recall. In addition to the deep learning methods, in the study of Ref. [11], DRLSE was used for the segmentation of EDH, IPH, and SDH, and Dice coefficients of 0.75, 0.62 and 0.37 were reported for each sub-type, respectively. Our method achieved a higher Dice coefficient of 0.52 in segmenting SDH. Some traditional methods reported better dice coefficient (0.87 [21], 0.89 [22], and 0.82 [26]) for the ICH segmentation when a small dataset was used.
Regarding the ICH detection, U-Net achieved a slice-level sensitivity of 97.2% and specificity of 50.4%, which is comparable to the results reported by Yuh and colleagues [8] when 0.5 threshold was used. Increasing the threshold to 0.8 resulted in 73.7% sensitivity, 82.4% specificity, and 82.5% accuracy, which is comparable to some methods in the literature that were trained on large datasets [13,16]. In [16], an ensemble of four 3D CNN models was trained on 10k CT scans and yielded 71.5% sensitivity and 83.5% specificity. In [13], a deep model based on DenseNet and RNN achieved 81% accuracy.
Our observation was that the main reason for the low Dice coefficient of the trained U-Net was the false positive segmentation as shown in Figure 5. The false positive segmentation was more prevalent near the bones where the intensity in the grayscale image is similar to the intensity of the ICH region. Another limitation is that the developed U-Net model failed to localize the ICH regions in the CT scans of two subjects who had a small IPH region. Hence, the current method as stands can be used as an assistive technology to the radiologists for the ICH segmentation but is not yet at a precision that can be used as a standalone segmentation method. Future work could be collecting further data and also enhancing U-Net with a recurrent neural network such as LSTM networks to consider the relationship between the adjacent scans when segmenting the ICH regions. Besides, the performance can be improved by utilizing a transfer learning to initialize the model weights before training the model on the ICH small dataset.

Conclusions
ICH is a critical medical lesion that requires an immediate medical attention, or it may turn into a secondary brain insult that could lead to paralysis or even death. The contribution of this paper is two-fold. First, a new dataset with 82 CT scans was collected. The dataset is made publicly available online at Physionet to address the need for more publicly available benchmark datasets toward developing reliable techniques for the automated ICH segmentation. Second, a deep learning method for the ICH segmentation was developed. The developed method was assessed on the collected data with 5-fold cross-validation. It resulted in a Dice coefficient of 0.31, which has a comparative performance for deep learning methods reported in the literature using small datasets. Moreover, the paper provides a detailed review of the methods for the detection of ICH and its sub-types as well as segmentation of the ICH. Developing an automated ICH screening tool could improve the diagnosis and management of ICH significantly when experts are not immediately available in the emergency rooms, especially in developing countries or remote areas. Acknowledgments: Thanks for Mohammed Ali for the clinical support and all the subjects participated in the data collection.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: