Development of a Convolutional Neural Network Based Skull Segmentation in MRI Using Standard Tesselation Language Models

Segmentation is crucial in medical imaging analysis to help extract regions of interest (ROI) from different imaging modalities. The aim of this study is to develop and train a 3D convolutional neural network (CNN) for skull segmentation in magnetic resonance imaging (MRI). 58 gold standard volumetric labels were created from computed tomography (CT) scans in standard tessellation language (STL) models. These STL models were converted into matrices and overlapped on the 58 corresponding MR images to create the MRI gold standards labels. The CNN was trained with these 58 MR images and a mean ± standard deviation (SD) Dice similarity coefficient (DSC) of 0.7300 ± 0.04 was achieved. A further investigation was carried out where the brain region was removed from the image with the help of a 3D CNN and manual corrections by using only MR images. This new dataset, without the brain, was presented to the previous CNN which reached a new mean ± SD DSC of 0.7826 ± 0.03. This paper aims to provide a framework for segmenting the skull using CNN and STL models, as the 3D CNN was able to segment the skull with a certain precision.


Introduction
Image segmentation is the process of partitioning an image into multiple sections to simplify the image into something more meaningful so that we can easily locate regions of interest (ROI). In medical imaging and analysis, these ROI, identified by the segmentation process in an image scanning system, can represent various structures in the body such as pathologies, tissues, bone, organs, prosthesis, and so forth.
Magnetic resonance imaging (MRI) and computed tomography (CT) are the most common medical image scanning system used to reveal relevant structures for automated processing of scanned data. Both techniques are excellent in providing non-invasive diagnostic images of organs and structures inside the body. However, CT is not favorable for routine anatomical imaging of the head since it exposes the patient to small doses of ionizing radiation each visit putting the patient at risk for developing diseases such as cancer. For instance, a study in [1] pointed out that the risk to develop leukemia and brain tumors increases with the radiation exposure from CT scans. On the contrary, MRI scans have difficulty identifying different tissues because of the low signal-to-noise ratio of MRI. Additionally, due to bones' weak magnetic resonance signal, MRI scans struggle with differentiating bone tissue from other structures. Specifically, since different bone tissues have the tendency to differ more in appearance from one another than from the adjacent muscle tissue, segmentation approaches must be robust to account for the variations in the structure [2]. Thus, bone segmentation from MRI presents a challenging problem. Current biomedical imaging segmentation methods take advantage of deep learning with convolutional neural networks (CNNs) [3], as seen in [4] where they trained a large deep CNN to classify over 1 million high-resolution images with a top-1 and top-5 test set error rates of 37.5% and 17.0%, much better than the previous state-of-the-art technique.
In CNN, each layer contains various neurons fixed in subsequent layers and sharing weighted connections. During training, these layers extract features (such as horizontal or vertical edges) from the training images that allows CNN to perform certain tasks such as segmentation by recognizing these features in subsequent images. The advantage of CNN over other techniques is that convolutional image filters are learned and adapted in an automated process for a high-level description in the finest optimization process.
With recent advances in graphical processing units and improvements in computational efficiency, CNNs have achieved excellent results in biomedical image segmentation where deep learning approaches can be performed in an efficient and intelligent way. CNN has been extensively applied in musculoskeletal image segmentation tasks such as brain and spine segmentation [5], acute brain hemorrhage [6], vessel segmentation [7], skull stripping in brain MRI [8], knee bone and cartilage segmentation [9], segmentation of craniomaxillofacial bony structures [10], proximal femur segmentation [11], and cardiac image segmentation [12].
A review on deep learning techniques has been performed by Garcia et al [13] where the authors highlight a promising deep learning framework for segmentation tasks known as UNet [14]. UNet is a CNN which uses an encoding down-sampling path and an upsampling decoding path for segmentation tasks to increase the resolution of the output, which has shown high performance when applied on biomedical images [7].
Although numerous MRI segmentation techniques are described in the literature, few have focused on segmenting the skull in MRI. One approach to segment the skull in MRI is mathematical morphology [15]. The authors describe a method where they first remove the brain by using a surface extractor algorithms and mask the scalp using thresholding and mathematical morphology. During the skull segmentation process, the authors use mathematical morphology to omit background voxels with similar intensities as the skull. Using thresholding and morphological operations, the inner and outer skull boundaries are identified, and the results are masked with the scalp and brain volumes to establish a closed and nonintersecting skull boundary. Applying this segmentation method to 44 images, the authors were able to achieve a mean dice coefficients of 0.7346, 0.6918, and 0.6337 for shifts CT of 1 mm, 2 mm, and 3 mm respectively.
Wang et al [16] utilized statistical shape information in 15 subjects, where the anatomy of interest is differentiated in the CT data by means of constructing an active shape model of the skull surfaces. The automatic landmarking on the coupled surfaces is optimized in statistical shape information by minimizing the description length that included the local thickness information. This method showed a dice coefficient of 0.7500 for one calvarium segmented. Support vector machine (SVM) combining local and global features are used in [17]. Feature vectors are constructed from each voxel in the image that is used as the first entry. The second input for this method uses a combination of intensities of a set of nearby voxels and statistical moments of the local surroundings. This feature vector is then introduced to a trained SVM that classifies the image as either a skull, or something else. By using SVM, the authors found a dice function mean of 0.7500 (0.68 minimum and 0.81 maximum) from 10 patients in a dataset of 40 patients.
Most recently, [21] analyzed three methods of skull segmentation and identified multiple factors contributing to the enhancement of the standard of segmentation. Using a data set obtained from 10 patients, they concluded that improved skull segmentation was accomplished by FSL [22] and SPM12 [23], achieving a mean dice coefficient of 0.76 and 0.80 respectively.
These techniques present an effective method with a mean DSC of 0.75 for small datasets, however, for larger datasets or when extended to images collected from different MRI devices where the image suffers from noise and variation in the choice of parameter values, this effectiveness may be compromised.
Arguably, one of the most important components in machine learning and deep learning are the ground truth labels. Careful collection of data and high-quality ground truth labels that will be used to train and test a model is imperative for a successful deep learning project, but comes with a cost in computation energy and may become very time consuming [24]. Minnema et al [25] displayed a high overlap with gold standard segmentation by introducing a CNN for skull segmentation in CT scans. In the image processing step, a 3D surface model, which represents the label, was created in the standard tessellation language (STL) file format, a well-established method to represent 3D models [26][27][28][29]. To convert the files from CT to STL, segmentation of a high-quality gold standard STL model was performed manually by an experienced medical engineer. The results show a slight one voxel difference between the gold standard segmentation and the CNN segmentation, with a mean dice similarity coefficient of 0.9200. Therefore, this work aims to introduce the deep learning approach, more precisely UNet, for skull segmentation purpose where the ground truth labels are created from CT imaging using the STL representation file format. Figure 1 presents the schematic overview of the proposed study. First STL models are created from 58 CT scans. After being converted into matrices, these images are then overlapped with the MR images to create the gold standard and the first dataset. Then, using dataset 1, the first CNN is created. To improve the accuracy, 62 MR images are used to generate brain STL models. The models are then converted into matrices to create a set of brain gold standard and a second dataset. A brain segmentation algorithm using a second CNN is created and, through this CNN model and manual corrections, the brain is removed from dataset 1. Finally, this new dataset is presented again to the first CNN topology. (1) STL models are produced from 58 CT scans and then overlapped with the MR images to create the first dataset. (2) 62 MR images are used to create brain STL models, and a brain segmentation algorithm is created. The brain segmentation algorithm is combined with manual corrections to remove the brain from dataset 1 to create dataset 2.
(3) Finally, these 2 datasets are compared using the same CNN topology.

Dataset
We used the cancer imaging archive data collections (TCIA) [30] to search for reliable datasets that contain CT and MRI from the same patient and a minimum variation in the coronal, sagittal, and transverse plane. 58 volumetric CT and MR images were selected from four datasets to meet these criteria:

Data Processing I
As this study aims to use CT scans to create ground truth labels, the first step was to generate the STL models. To perform this task, CT images were imported into Mimics Medical Imaging Software (Materialise, Leuven, Belgium). By using individual global thresholding in combination with manual corrections, the 3D model mesh was built, which allowed the STL model to be constructed (Figure 2a-c). Then, to convert the geometric information (STL model) into a continuous domain (matrix), voxelisation was performed using [39] in MATLAB R2019B software (Figure 2d).
To generate the MRI labels, the STL models extracted from CT ground truth was overlapped into each MRI slice in 3-modal MRI (T1, T2, and FLAIR) using a combination of manual translations, rotations, and scaling. These manual alignments were followed by visual inspection and fine adjustment to ensure good quality ( Figure 3). T2-weighted scans were included because the border between the skull and cerebro-spinal fluid (CSF) can be better delineated, as CSF appears bright in T2-weighted scans and has presented good results in [18,21]. Finally, all images were normalized between the range of 0 and 1 and then resized to 256 × 256 using nearest-neighbor interpolation method to improve the processing time.

Data Processing II
To generate a second comparison, which can lead to an improvement in the accuracy, a reduction in the dataset information was performed by removing region of non-interest. The idea is to reduce the information content in the dataset by removing the gray and white matter. To perform this task, 62 volumetric MR images were randomly chosen and, in a similar manner explained in Section 2.2, brain gold standard labels were created from the MR image. The creation of brain labels is easily performed in MRI since the brain is easy to identify in magnetic resonance. The processing initially starts with the application of the thresholding, followed by region growing, and then creation and extraction of the STL models from the MRI (Figure 4).

CNN Architecture and Implementation Details
The deep learning framework chosen in this paper was UNet which was introduced by Ronneberger et al. [14]. This type of CNN was chosen because it works well with very few training images, yields more precise segmentation, and has been used in a number of recent biomedical image segmentation applications [5,[9][10][11]40]. This network allows for a large number of feature channels in the upsampling procedure, which contribute in the propagation of context information to the highest resolution layers. The result is a more symmetric expansive path and a U-shaped architecture.
In our implementation, we adopted a 3D UNet initially developed for brain tumor segmentation in MRI [41]. To avoid class imbalance when using conventional cross entropy loss, a weighted multiclass dice loss function was used as the general loss of the network [42]. Table 1 shows the implementation parameter chosen for the skull segmentation. The parameters were chosen to avoid computational error, inprove the robustness and generalization ability of the CNN, and obtain a good accuracy for the training set explored in this work.  The CNN model was performed on Intel i7-9700 (3.00 GHz) workstations with 64 GB of ram, and two 8GB VRAM graphic cards from NVIDIA (RTX 2070 SUPER and RTX 2080). The code was implemented in MATLAB R2019B.
Dice similarity coefficient is a spatial overlap index that varies from the ranges 0, indicating no spatial overlap between two sets of binary segmentation results, to 1, indicating complete overlap [48]. SVD gives the symmetric difference of the shape and segmentation in terms of dice-based error metric. JSC is a similarity ratio which describes the intersection between the ground truth and the machine segmentation regions over their union. It ranges from 0% to 100% of similarity. VOE is the JSC correponding error measure. Finally, to measure the segmentation accuracy in terms of distance betwen the predicted segmentation boundary and the ground truth, Hausdorff distances using Euclidean distance are used.

Performance Analysis
The 58 volumetric CT and MR images were randomly divided into 49 for training, 9 for validation/testing. Table 2 presents the statistical analysis of the 9 test images after being trained and tested 10 times for 600 epochs. DSCs, SVDs, JSCs, VOEs, and HDs, are calculated from the gold standard labels and predicted labels. DSCs of the skull varies from 0.6847 to 0.8056, with a mean ± SD of 0.7300 ± 0.04, and from 0.9654 to 0.9833 for background, with a mean ± SD of 0.9740 ± 0.007. To improve the results, by a reduction of regions of non-interest, a 3D UNet task was performed by using the same software/equipment previously used. The difference between these two methods, rely on the creation of the gold standard from different image modalities and 3D UNet parameters. This approach does not require the CT and MRI scans to overlap in order to create MRI gold standard labels. Instead, this approach uses it own volumetric MRI to create the labels through the same processing presented previously.
Using the same datasets from the cancer imaging archive data collection, 62 different volumetric MRI were used to create the brain dataset, where 53 were used for training, 3 for validation, and 6 for testing purpose. Table 3 shows the CNN implementation details of the brain segmentation and the statistical analysis of the 6 tested images are presented in the Table 4 after 10 rounds of testing. DSCs, SVDs, JSCs, VOEs, and HDs, are calculated from the brain gold standard labels and brain predicted labels.
These results show an accuracy rate of 0.9244 ± 0.04, however, the brain must be perfeclty extracted. Therefore, after CNN was tested on the 58 initial volumetric MRI, the labels generated in this process were manually corrected using the Matlab program created by [49] in order to optimize brain removal.  After removing the gray and white matter, the modified 58 volumetric images, 49 for training and 9 for validation/testing, were then presented to the CNN using the same parameters shown in Table 1, and the statistical analysis of the 9 tested images are shown in Tables 5 and 6. From Tables 5 and 6, it can be stated that the reduction in the information contained in the images, such as the removal of the brain, helps in improving the segmentation of the skull bones. In fact, the DSC improvement varied from 2.31% to 7.27% for the skull and 0.24% to 0.99% for the background, which demonstrates that the removal of information in images inherently affects the segmentation of the skull directly. Thus, the initial DSC mean for skull and background improved from 0.7300 ± 0.04 and 0.9740 ± 0.007 to 0.7826 ± 0.03 and 0.9804 ± 0.004 respectively, with a decrease in the standard deviation.
The results represented by DSCs from Table 2 in this present study are marginally lower to those reported in [15][16][17][18], who achieved mean DSCs of (0.7346, 0.6918, 0.6337), 0.75, 0.75, and (0.7344, 0.7446) respectively, and lower than that reported by [21] of (0.76 and 0.80). Howerver, the interpretations of the different results found in these studies must be evaluated with caution due to the differences between the databases used, computational methods, and so forth.
This distinction may be attributed to the size of the dataset used. [40,50] reported a mean dice coefficient of 0.9189 and 0.9800 for CT skull segmentation using CNN when a dataset of 195 and 199 images was used, and [50] attributed this high DSC to the dataset size and image resolution when compared to other study ( [25]). Thus, a change in the size of the dataset may contributed to the improvement of the values of the DSCs. In addition, the geometric disparity, variations, and deformity between the skulls sets become more evident as the dataset increases. As the related works used small datasets, this aspect may have led to the high DSCs reported.
Furthermore, we used four distinct datasets [31,32,35,36] that use different CT and MRI devices with a variety of parameters. These datasets included variations in age, ethnicity, and medical history. In addition, several patients have undergone cephalic surgical treatment which may altered the skeletal structure of the skull. In total, 40 images have part of the skull removed due to brain abnormalities. These removals may have affected the performance of the segmentaion.
From Table 6, JSC and HD improved from the initial values (Dataset 1), while SVD and VOE decreased. These improvements and reductions can be due to the fact that there is less overlap between the ground truth and the predicted segmentation in the brain region since there is no brain.
One drawback of presented method is during the creation of the gold standard STL model. An expert manually corrected the models by edge smoothing or noise residue removal which may prompt the CNN to learn the defects the expert may have created. Furthermore, during the conversion from STL model into label (voxelisation), a quantity of information from the skull-voxel may have been erroneously labeled as background when converted into imaging-voxel. Other disadvantage regards the number of training images. Unfortunately, the amount of usable data that can be acquired from the same patient for both CT and MR images is difficult because of alignment issues, and commonly limited due to ethical and privacy considerations and regulations.
The results found in this article reflect a long-standing search for the development of a technique for bone segmentation in MRI, however, the proposed method DSC (0.7826 ± 0.03) does not exceed the performance of current CT techniques, DSC of 0.9189 ± 0.0162 [40]), DSC of 0.9200 ± 0.0400 [25]), and DSC of 0.9800 ± 0.013 [50]).

Comparison between UNet, UNet++, and UNet3+
Further comparison is necessary to see how the DSC behave in various deep learning methods. UNet was compared to UNet++ [51], an encoder-decoder network where a series of dense skip pathways are connected in the encoder and decoder sub-networks, and UNet3+, a deep learinng approach that uses the full-scale aggregated feature map to learn hierarchical representations while, using feature maps in various scales, incorporate lowlevel details with high-level semantics [52]. Table 7 compares UNet, UNet++, and UNet3+ architecture in terms of segmentation accuracy measured by dice similarity coefficient on both datasets 1 and 2. The parameters for each CNN were identical, with an encoder depth of 3, a mini batch size of 16, an initial learning rate of 0.005, and the training was carried out in 100 periods.
As seen, UNet3+ outperformed UNet and UNet++, obtaining average improvement over UNet of 1.51% and 0.26% in datasets 1 and 2. The UNet algorithm took about an hour to train the 100 epochs, and UNet3+ took about 2.5 times longer. Therefore, if the training time of the UNet3+ is disregarded, this network may be used to slightly improve segmentation results.

Conclusions
This study presents a 3D CNN developed for skull segmentation in MRI where the trained labels were acquired from the same patient CT scans in standard tessellation language. This method initially demonstrated a skull DSC overlap of 0.7300 ± 0.04 and 0.9740 ± 0.007 for background, however, after the removal of the gray and white matters, DSC reached an average of 0.7826 ± 0.03 and 0.9804 ± 0.004 respectively. Due to the limited number of datasets tested, further research may be undertaken to improve the mean DSC. In summary, the present method is a step forward in the improvement of bone extraction in MRI using CNN to achieve average DSC rates similar to those obtained in CT scans.

Conflicts of Interest:
The authors declare no conflict of interest.
Disclosure Statement: This study does not contain any experiment with human participants performed by any of the authors.

Abbreviations
The following abbreviations are used in this manuscript: