1. Introduction
Automatic vertebrae segmentation from medical images is critical for spinal disease diagnosis and treatment, e.g., assessment of spinal deformities, surgical planning, and postoperative assessment; computed tomography (CT) is one of the most commonly used imaging methods in clinical practice [
1], and the convolutional neural network has become the best choice for processing such images.
In practice, vertebrae segmentation from volumetric CT image suffers from the following challenges:
Inter-class similarity: Shape and appearance similarities appear in the neighboring vertebrae from the sagittal view. It is difficult to distinguish the first lumbar vertebra (L1) and the second lumbar vertebra (L2) (as shown in
Figure 1a);
Unhealthy vertebrae, such as deformity or lesions (as shown in
Figure 1b);
Interference information: there are several soft tissues whose gray-scale is similar to the vertebrae area (as shown in
Figure 1c,d).
Vertebrae segmentation has been approached predominantly as a model-fitting problem using statistical shape models (
SSM) [
2,
3] or active contour methods [
4,
5] in the early stage. However, most traditional methods are only suitable for high-quality CT images in which the vertebrae are healthy, and they cannot perform well in the complex scenario shown in
Figure 1. Subsequently, machine learning has been widely used. Chu et al. [
6] detected the center of the vertebral bodies using random forest and Markov random fields and then used these centers to obtain fixed-size regions of interest (
ROI), in which vertebrae were segmented using random forest voxel classification. Korez et al. [
7] used a convolution neural network to generate a probability map of vertebrae, and these maps were used to guide the deformable model to segment each vertebra. Although machine learning methods outperform the traditional approaches in speed and efficiency, they have no obvious advantage in segmentation accuracy.
More recently, the deep learning model
U-Net [
8] has achieved great success in the task of medical image segmentation, which has become one of the most popular models in this field. Kim et al. [
9] proposed a deep learning algorithm to detect ankle fractures from X-rays and CT scans and achieved quite good results.Their work proved the powerful ability of deep learning in AI-assisted diagnosis and treatment. Moreover, Holbrook et al. [
10] and Yogananda et al. [
11] proposed two improved U-Net networks for tumor segmentation respectively. In order to improve the segmentation accuracy, many models based on a cascade structure were proposed. Sekuboyina et al. [
12] presented a two-stage lumbar vertebrae segmentation algorithm, which first extracted the ROI of lumbar vertebrae using the multi-layer perceptron (
MLP) and then performed instance segmentation within these regions. Similarly, Janssens et al. [
13] segmented the five lumbar vertebrae using two-cascade
3D U-Net. However, these methods are not designed for spinal CT images that have a varying number of visible vertebrae and can only be used to segment the lumbar vertebrae. In order to solve the problem of a prior unknown number of vertebrae, Lessmann et al. [
14] proposed an iterative vertebrae instance segmentation method, which slides a window over the images until a complete vertebra is visible, and then performed instance segmentation and saved it to the memory module. This process was repeated until all fully visible vertebrae were segmented. Unfortunately, there is a deficiency in this method, which requires a lot of memory during training.
In this study, we aim to propose a novel neural network applied to the vertebrae segmentation task. Firstly, the size and position of the vertebrae are not fixed in CT volumes. The 3 × 3 filter is too small to extract global information. Simply expanding the kernel size will quickly increase the number of parameters. We explore using a different combination of receptive field sizes to improve the network perception concerning the shape and size of vertebrae. Secondly, treating the representations on the different spatial positions and channels in the feature map equally will result in a large amount of computation cost and lower segmentation accuracy. At the same time, in order to enhance feature representation and suppress interference information, we introduce an attention mechanism to explore and emphasize the interdependencies between channel maps. Finally, a coarse-to-fine strategy is employed in the prediction stage: every decoded feature will produce a coarse segmentation and then concatenate them to generate the final fine prediction.
4. Discussion
In this paper, we proposed a novel convolutional neural network to segment vertebrae from computed tomography images. By comparing the performance of the five methods regarding segmentation results using image assessment metrics, it was demonstrated that our model generated superior segmentation predictions compared to the other four methods, with the highest segmentation accuracy and stable segmentation ability.
It is worth noting that the neural network works like a ‘black box’: all data in this paper are statistically obtained several times. From the box plot of
Figure 6, it can be seen clearly that the superiority of our model does not happen occasionally or accidentally. The ablation study demonstrates that the multi-scale convolution and attention mechanism have made a great contribution to the overall model. Multi-scale convolution is responsible for improving the segmentation accuracy, while the attention mechanism plays an important role in keeping the boundary contour. Under deep supervision, their advantages have been brought into full play. Similarly, their role can be verified from the box plot of
Figure 9.
The main purpose of this paper is to present a deep-learning-based model, so we only compare it with four related networks. Unlike classification tasks, the annotation for vertebrae segmentation is so difficult and professionally demanding that open access and high-quality segmentation datasets are rare. To our knowledge, CSI 2014 was annotated by doctors and covers the entire thoracic and lumbar spine. Thus, we chose it as the experimental dataset to verify our model. Compared with participating algorithms, our method generated predicted results through only one neural network, so the efficiency is significantly higher.
Table 5 gives the comparison between our model and participating algorithms regarding the processing time per case.
From the view of expandability, our model seems more similar to a segmentation framework. Obviously, it can be easily transformed into a 3D segmentation network. Furthermore, both multi-scale convolution and attention mechanisms are expandable. For example, self-attention [
25,
26] realized a more accurate similarity measurement between two feature representations by dot product calculation. Some studies [
27,
28] have achieved fairly good results based on this. Although the computational complexity is very high, it is still worth exploring in future work. For multi-scale convolution [
29,
30], the key point is the parameter.
[
31] and
[
32], respectively, used atrous convolution and pyramid pooling to capture the local and global information in the image. Thus, multi-scale feature extraction with fewer parameters can also be applied to our model.