Automatic Deep Feature Learning via Patch-Based Deep Belief Network for Vertebrae Segmentation in CT Images

: Precise automatic vertebra segmentation in computed tomography (CT) images is important for the quantitative analysis of vertebrae-related diseases but remains a challenging task due to high variation in spinal anatomy among patients. In this paper, we propose a deep learning approach for automatic CT vertebra segmentation named patch-based deep belief networks (PaDBNs). Our proposed PaDBN model automatically selects the features from image patches and then measures the differences between classes and investigates performance. The region of interest (ROI) is obtained from CT images. Unsupervised feature reduction contrastive divergence algorithm is applied for weight initialization, and the weights are optimized by layers in a supervised ﬁne-tuning procedure. The discriminative learning features obtained from the steps above are used as input of a classiﬁer to obtain the likelihood of the vertebrae. Experimental results demonstrate that the proposed PaDBN model can considerably reduce computational cost and produce an excellent performance in vertebra segmentation in terms of accuracy compared with state-of-the-art methods.


Introduction
An accurate segmentation of individual vertebra in computed tomography (CT) images is an essential step for the clinical diagnosis and treatment of pathologic conditions. It is used in various applications of image-based vertebrae assessment, biomechanical modeling, and surgery planning [1]. It plays a vital role in computer-aided diagnoses (CAD) as it used for different clinical tasks to diagnose various diseases like scoliosis, kyphosis, spondylolisthesis, degenerative disc disease, etc. However, the exact vertebral boundaries cannot be determined due to the articulation of the vertebrae with each other that may induce vertebral overlaps in the segmentation of adjacent vertebrae. Despite an increasing attention in vertebra segmentation recently, robust and accurate spine segmentation methodologies are still deficient. Previous works on vertebra segmentation have used a multistep approach incorporating localization of spine and vertebra detection, followed by segmentation [2,3]. Multiple techniques have been applied to vertebra segmentation, such as active shape model and snake-based methods [4,5], level sets, or graph-based techniques using normalized cuts [6,7]. Most of these models depend on prior knowledge in the form of a statistical shape model or spine atlases. The performance of such methods is high on healthy spines with a good score of DICE coefficient. However, some of these methods fail in accurate segmentation with patients of osteoporotic fractures because such patients often suffer vertebral fractures in different stages and spinal deformities, such as the weights are undirected between visible and hidden layer. The two-step DBN learning consists of unsupervised and supervised. Unsupervised feature learning is implemented by the CD algorithm which trains the stacked RBMs to obtain the initial weight, and supervised learning is implemented by the back-propagation algorithm.
Patch-based methods have obtained effective results for segmentation from medical images [33,34]. These methods label each voxel in the target images by image patch comparison, center on the voxel with patches from an atlas library, and allocate the most probable label in accordance with the closest and best match. Patch-based methods of label fusion have been presented and have given robust segmentation. These methods recently show improved results in different computer vision tasks, including super-resolution [35] and texture synthesis [36]. Image denoising [37] has given a promising performance and led to various patch-based segmentation tool development for the facilitation of medical imaging application methods [33,34,38,39]. In a previous study [40], the corner detection of cervical vertebrae using the patch-based methodology in X-ray images is proposed. This work is extended for cervical vertebra segmentation in X-ray images by using a deep fully convolutional neural network (CNN) and has achieved a DICE similarity coefficient of 84% with a shape error of 1.69 mm [41]. In this work, twelve patches around the vertebra are generated to provide the spatial context, and the vertebra corners are detected using a machine learning method. A liver segmentation from CT images is presented in [42] from deep stack autoencoder (DSAE). This deep learning-based method learns features from unlabeled data by using autoencoder and fine-tuned these features for segmentation the liver from other abdominal organs. An iterative convolutional neural network based automatic method for vertebrae identification and segmentation is proposed previously [43] to achieve a mean DICE coefficient of 0.948 and a mean surface distance of 0.29 mm. This method used the inherent order of vertebral column to make solve of detection problem and the interactive procedure is utilized for vertebrae identification and segmentation one by one in sequential order. In this method for enabling the contextual information, the vertebra in the low-resolution images is first roughly identified and localized, and afterward reanalyzed in the original high-resolution images to achieve a fine segmentation. A DBN based vertebrae segmentation is proposed in our preliminary work from CT images in [44].
Unlike feature representation by the convolutional neural network (CNN) that comprises convolutional and subsampling tasks in learning a set of locally connected nodes through the locally receptive field for features extraction, our proposed PaDBN model is a full connection method for high-level feature learning. Unsupervised feature learning is implemented by the CD algorithm which trains the stacked RBMs to obtain the initial weight and supervised learning is implemented by the back-propagation algorithm to fine-tune the whole network and can optimize the deep-learned features for certain tasks, such as segmentation. While CNN is a partially connected approach that stresses the importance of locality while PaDBN is a fully connected approach which gets learning a single global weight matrix for features representation. The size of overlapping patches was set to 32 × 32 pixels. Therefore, we choose PaDBN instead of CNN in our work.
The objective we purse is to achieve the segmentation of vertebrae from CT images. In the present study, we propose based on our preliminary work [44], a new hierarchical deep learning feature representation PaDBN method for vertebrae segmentation. We adapted square and rectangular ROI extraction from input CT image for overlapping patch generation and used balance classes in training set to get high classification accuracy that was not done in our preliminary work [44]. Note, no matter what the proportion of samples in testing set, smaller and compact deep belief network architecture is adopted to minimize resources with better accuracy than our previous architecture. Our preliminary work describes a proof-of-concept strategy for improving vertebrae segmentation and result suggests that a fast correction method improves the segmentation and has potential to be incorporated as an additional implement into clinical usage. These preliminary results further motivated us to perform a full evaluation of PaDBN model for vertebrae segmentation accuracy. Moreover, this work also reported known limitation of the previous version and proposed a variation of the method to circumvent them. Our proposed model is an unsupervised learning of input feature statistics from CT images. We transfer the obtained network weight matrix to back-propagation neural network with the same structure to start the supervised learning step. In supervised learning step, conjugate gradient is tested for back-propagation neural network learning. These learned features are further integrated into the patch matching framework to find patches for label propagation. Then, a deformable model is adopted for the vertebra segmentation by the vertebra likelihood mapping driven from patch matching. Finally, the postprocessing stage improves the segmentation by removing the overflow portion by different morphological operations. The output of the postprocessing step is the final vertebra segmentation. The proposed method is extensively evaluated on CT datasets from the 2014 MICCAI workshop on the spine (CSI 2014) [45]; these datasets are available on SpineWeb [46]. The experimental results show that the deep learning features of PaDBN achieve excellent segmentation accuracy. Compared with state-of-the-art vertebra segmentation methods, our algorithm obtains competitive segmentation accuracy.

Methodology
Similar to other deep learning-based frameworks, our PaDBN framework can be divided into two stages: training and testing. The PaDBN model for automatic vertebra segmentation further comprises the preprocessing phase, patch generation, PaDBN, and segmentation steps. An overview of our proposed method is shown in Figure 1. These learned features are further integrated into the patch matching framework to find patches for label propagation. Then, a deformable model is adopted for the vertebra segmentation by the vertebra likelihood mapping driven from patch matching. Finally, the postprocessing stage improves the segmentation by removing the overflow portion by different morphological operations. The output of the postprocessing step is the final vertebra segmentation. The proposed method is extensively evaluated on CT datasets from the 2014 MICCAI workshop on the spine (CSI 2014) [45]; these datasets are available on SpineWeb [46]. The experimental results show that the deep learning features of PaDBN achieve excellent segmentation accuracy. Compared with state-of-the-art vertebra segmentation methods, our algorithm obtains competitive segmentation accuracy.

Methodology
Similar to other deep learning-based frameworks, our PaDBN framework can be divided into two stages: training and testing. The PaDBN model for automatic vertebra segmentation further comprises the preprocessing phase, patch generation, PaDBN, and segmentation steps. An overview of our proposed method is shown in Figure 1.

Preprocessing
The CT dataset can be represented by f: Ω × {1… s} R 2 × N → R, where 's' is the number of slice images in a volume dataset, Ω is domain, and f: Ω → R is a bivariate function on a domain Ω R 2 . Prior to segmentation, the preprocessing phase is necessary for preparing the relevant information from the CT images of the vertebra. This procedure includes different steps for preparing the meaningful features to train the model. We reduce the image size to the region of interest (ROI) by cropping the empty borders. The original 512 × 512 pixel images shown in Figure 2 are cropped to bounding box around the vertebra in different pixel image sizes, and the datasets are shown in Figure   Figure 1. The proposed patch-based deep belief networks (PaDBN) model for vertebrae segmentation.

Preprocessing
The CT dataset can be represented by f: Ω × {1 . . . s} ⊂ R 2 × N → R, where 's' is the number of slice images in a volume dataset, Ω is domain, and f : Ω → R is a bivariate function on a domain Ω ⊂ R 2 . Prior to segmentation, the preprocessing phase is necessary for preparing the relevant information from The bounding box surrounding the vertebrae decreases the patch numbers for fast and efficient computation. We focus on segmenting the vertebrae and image patch containing the vertebra region. The bounding box size is sufficiently large to adjust the vertebra spatial variation across the images of the CT dataset. The overlapping patches are created from the ROI around the vertebrae. Two types of ROI geometry are considered square and rectangular ROIs. We applied the Gaussian filter to the CT images with a fixed kernel size to control the smoothness of images for achieving segmentation accuracy and attenuating the effect of noisy pixels. The large value of the Gaussian kernel is correlated with the large deformation, and the estimated transformation is used as the input of a small kernel of smoothness. As a preprocessing step, a rough threshold is applied on the entire CT volume to remove noise artifacts. The purpose of the threshold window is to distinguish the vertebrae from other soft tissues [47]. The reason for applying threshold is that vertebrae have higher intensities than other tissues in CT images but are also similar to other bony structure, such as ribs. The network learns the difference between vertebra positive samples from other bony structure as negative samples.

Patch Generation
We divide the cropped CT images into m × m overlapping patches in accordance with the label on each pixel that is connected to a long-dimensional vector sequence and used as input for the network. The entire patches are created as binary class datasets, namely, vertebrae (positive sample) and nonvertebrae (negative samples), as shown in Figure 4. The overlapping patch of CT images is  The bounding box surrounding the vertebrae decreases the patch numbers for fast and efficient computation. We focus on segmenting the vertebrae and image patch containing the vertebra region. The bounding box size is sufficiently large to adjust the vertebra spatial variation across the images of the CT dataset. The overlapping patches are created from the ROI around the vertebrae. Two types of ROI geometry are considered square and rectangular ROIs. We applied the Gaussian filter to the CT images with a fixed kernel size to control the smoothness of images for achieving segmentation accuracy and attenuating the effect of noisy pixels. The large value of the Gaussian kernel is correlated with the large deformation, and the estimated transformation is used as the input of a small kernel of smoothness. As a preprocessing step, a rough threshold is applied on the entire CT volume to remove noise artifacts. The purpose of the threshold window is to distinguish the vertebrae from other soft tissues [47]. The reason for applying threshold is that vertebrae have higher intensities than other tissues in CT images but are also similar to other bony structure, such as ribs. The network learns the difference between vertebra positive samples from other bony structure as negative samples.

Patch Generation
We divide the cropped CT images into m × m overlapping patches in accordance with the label on each pixel that is connected to a long-dimensional vector sequence and used as input for the network. The entire patches are created as binary class datasets, namely, vertebrae (positive sample) and nonvertebrae (negative samples), as shown in Figure 4. The overlapping patch of CT images is shuffled and then packed into the same size batches to the feed input of the PaDBN model. The bounding box surrounding the vertebrae decreases the patch numbers for fast and efficient computation. We focus on segmenting the vertebrae and image patch containing the vertebra region. The bounding box size is sufficiently large to adjust the vertebra spatial variation across the images of the CT dataset. The overlapping patches are created from the ROI around the vertebrae. Two types of ROI geometry are considered square and rectangular ROIs. We applied the Gaussian filter to the CT images with a fixed kernel size to control the smoothness of images for achieving segmentation accuracy and attenuating the effect of noisy pixels. The large value of the Gaussian kernel is correlated with the large deformation, and the estimated transformation is used as the input of a small kernel of smoothness. As a preprocessing step, a rough threshold is applied on the entire CT volume to remove noise artifacts. The purpose of the threshold window is to distinguish the vertebrae from other soft tissues [47]. The reason for applying threshold is that vertebrae have higher intensities than other tissues in CT images but are also similar to other bony structure, such as ribs. The network learns the difference between vertebra positive samples from other bony structure as negative samples.

Patch Generation
We divide the cropped CT images into m × m overlapping patches in accordance with the label on each pixel that is connected to a long-dimensional vector sequence and used as input for the network. The entire patches are created as binary class datasets, namely, vertebrae (positive sample) and nonvertebrae (negative samples), as shown in Figure 4. The overlapping patch of CT images is shuffled and then packed into the same size batches to the feed input of the PaDBN model. To achieve optimal training, we use balanced training classes with equal number of samples (vertebra and background). This approach has a potential sampling bias in removing and selecting regions of sampling cases from the class with a large number of samples. The number of positive samples is equivalent to that of negative samples as balance training class that proves beneficial for model training. These steps reduce the computational time, generalization, and stability of the trained model. The purpose is to discriminate the vertebra from other objects in the CT image. Thus, the problem is two-class classification. Notably, no balanced class is done in the testing stage. For verification on balanced and unbalanced training classes, we conduct experiments to compare the accuracy on both datasets but achieve better results on balanced training class.

A. Patch to Feature Vector Conversion
The image informative representation is a feature that facilitates robust image segmentation. Deep learning, a subfield of machine learning, is used in the identification of meaningful features of the image. DBNs recently show impressive performance in various image analysis tasks [48][49][50]. In this study, DBN is used to convert the overlapped patches into feature vectors. To train and test our model, the image patches are converted into feature vectors. The proposed PaDBN model is used to extract the abstract features from CT images. The DBNs [31] have learning ability to extract a deep hierarchical representation of data images. Here, DBN is used as feature extractor because it can generate robust features that lead to the improvement in classification performance [48][49][50][51]. Preprocessing, i.e., cropping, normalization, smoothing, and balancing, is used to achieve the best representative features of raw images. This procedure is followed by utilizing the unsupervised modeling for extraction of deep structural features. The two hidden layers are used to extract the features from the vertebrae. The input visible layer and first hidden layer construct RBM1, and the first and second hidden layers form the RBM2 structure. The previous layer in RBM is used as a visible layer for the subsequent RBM. Considering CPU limitations and computational cost, the CT images after preprocessing are divided into 32 × 32 overlapped patches to prevent any distortions. Thus, each individual patch is presented by a row vector with 1024 dimensions. A total of 1024 dimension features from each individual patch are used in training the model. The model starts with 1024 units in the visible layer that shows high efficiency in feature selection and then finds the suitable number for these hyperparameters.

B. Training scheme
A model is performed on the learned features x to generate the high-level features of patches p for characterizing the PaDBNs of vertebra CT images. The output of the former layer in DBN is used as an input for the subsequent layer after several RBMs are stacked in a hierarchical manner.
Particularly, DBN lth layer can be shown as To achieve optimal training, we use balanced training classes with equal number of samples (vertebra and background). This approach has a potential sampling bias in removing and selecting regions of sampling cases from the class with a large number of samples. The number of positive samples is equivalent to that of negative samples as balance training class that proves beneficial for model training. These steps reduce the computational time, generalization, and stability of the trained model. The purpose is to discriminate the vertebra from other objects in the CT image. Thus, the problem is two-class classification. Notably, no balanced class is done in the testing stage. For verification on balanced and unbalanced training classes, we conduct experiments to compare the accuracy on both datasets but achieve better results on balanced training class.

Patch to Feature Vector Conversion
The image informative representation is a feature that facilitates robust image segmentation. Deep learning, a subfield of machine learning, is used in the identification of meaningful features of the image. DBNs recently show impressive performance in various image analysis tasks [48][49][50]. In this study, DBN is used to convert the overlapped patches into feature vectors. To train and test our model, the image patches are converted into feature vectors. The proposed PaDBN model is used to extract the abstract features from CT images. The DBNs [31] have learning ability to extract a deep hierarchical representation of data images. Here, DBN is used as feature extractor because it can generate robust features that lead to the improvement in classification performance [48][49][50][51]. Preprocessing, i.e., cropping, normalization, smoothing, and balancing, is used to achieve the best representative features of raw images. This procedure is followed by utilizing the unsupervised modeling for extraction of deep structural features. The two hidden layers are used to extract the features from the vertebrae. The input visible layer and first hidden layer construct RBM1, and the first and second hidden layers form the RBM2 structure. The previous layer in RBM is used as a visible layer for the subsequent RBM. Considering CPU limitations and computational cost, the CT images after preprocessing are divided into 32 × 32 overlapped patches to prevent any distortions. Thus, each individual patch is presented by a row vector with 1024 dimensions. A total of 1024 dimension features from each individual patch are used in training the model. The model starts with 1024 units in the visible layer that shows high efficiency in feature selection and then finds the suitable number for these hyperparameters.

Training Scheme
A model is performed on the learned features x to generate the high-level features of patches p for characterizing the PaDBNs of vertebra CT images. The output of the former layer in DBN is used as an input for the subsequent layer after several RBMs are stacked in a hierarchical manner.
Particularly, DBN lth layer can be shown as where W l weight matrix shows the association between input and output nodes and b l and c l are the bias vector correspondence of lth RBMs input and output layer, respectively. The given input , the M l output can denoted as where s M l , h l−1 shows the RBMs sampling procedure M l with input nodes h l−1 l . For training of a RBMs, the joint probability of lth layer can be represented as The energy function can be modeled as where w l ij belongs to real-valued weight linked between the nodes h l−1 i and h l j . There are many ways to learn an RBM. Finally, the PaDBN is constructed by stacking of all learned RBMs using divergence procedure to learn the parameters of each RBMs [52]. The overlapping patches (m × m) are fed as the input to RBMs in a hierarchical manner. The two nodes output layer of PaDBN can be greedily trained as shown in Figure 5.

Regression Model and Classification
Two methods can be used to convert the DBN into the regression model. In the first method, the DBN is trained in an unsupervised manner and then converted into the neural network (NN) without adding any other layer. In this way, all the samples are fed forward through this NN, and the output is treated as samples for the regression model. In the second method, DBN is converted into the NN after its supervised training by adding an output layer (with two neurons) for regression on the top. Thereafter, NN is trained with the labeled dataset for fine tuning and treated as a regression model. In our work, we use the second method for the regression model. For classification, we used the sigmoid function for two-class logistic regression. Sigmoid function utilization is suitable for our binary classification problem because it maps the predicted values to probabilities. The two nodes in the output layer enable the network to select the most relevant features for representation of each image patch. Thus, the deep learning features have two dimensions. The function can be written as follows where w belongs to real-valued weight linked between the nodes h and h . There are many ways to learn an RBM. Finally, the PaDBN is constructed by stacking of all learned RBMs using divergence procedure to learn the parameters of each RBMs [52]. The overlapping patches (m×m) are fed as the input to RBMs in a hierarchical manner. The two nodes output layer of PaDBN can be greedily trained as shown in Figure 5.

C. Regression Model and Classification
Two methods can be used to convert the DBN into the regression model. In the first method, the DBN is trained in an unsupervised manner and then converted into the neural network (NN) without adding any other layer. In this way, all the samples are fed forward through this NN, and the output is treated as samples for the regression model. In the second method, DBN is converted into the NN after its supervised training by adding an output layer (with two neurons) for regression on the top. Thereafter, NN is trained with the labeled dataset for fine tuning and treated as a regression model. In our work, we use the second method for the regression model. For classification, we used the sigmoid function for two-class logistic regression. Sigmoid function utilization is suitable for our

Segmentation
At the testing stage, image patches from unseen vertebra are fed into the model. The prediction stage first predicts the test image class patch along with the trained model. After placing a distribution at each of the feature vectors, the final probability map is generated by adding all distributions. The proposed network focuses on vertebra segmentation. The segmented image is obtained from the reconstruction of test patch probability map distribution. However, some flaws are observed in segmented images in boundary detection, and the results are poor. Some pixels of the background are segmented as vertebrae because of high contrast among vertebrae, ribs, and some other bony structure tissues. Moreover, some vertebra pixels are missing from foreground. To remove such flaws from the segmented images, we apply different morphological operations [11] as a postprocessing step to obtain the final segmentation.

Experimental Results
We evaluate our proposed model on CT datasets provided for CSI spine and vertebra segmentation at the 2nd MICCAI workshop and corresponding ground truth obtained from trauma center during the daily medical examination. The CT scan comprises all thoracic (T1-T12) and lumbar (L1-L5) vertebrae with 1 mm of slice thickness, and the data resolution ranges from 0.31 mm to 0.45 mm for in-plane resolution. These datasets are obtained from the patient's age group between 16 to 35 years old [45]. The proposed work is implemented in MATLAB 2017a as backend and trained on Dell desktop with a 3.6GHz Intel(R) i7 CPU, 32GB RAM, and Windows 7 ultimate operating system.
The training process evolves for binary segmentation of vertebra where all the vertebrae have the same labels, and the model distinguishes between vertebra and background in which features are learned in an unsupervised manner called pretraining. The activation conditional probability over hidden and visible units is given by the logistic function and is used as an activation function for the sigmoid function. Each RBM is trained by contrastive divergence (CD) algorithm [31]. The input image patch of visible layer x constructs RBM1 with the first hidden layer h1, and h1 is treated as a visible layer for constituting RBM2 with the second hidden layer h2. In our training procedure, we use two RBMs to train our model. The parameters of RBM1 (W, b, c) are obtained using CD algorithm. By fixing this layer parameter and training RBM2 from CD algorithm, the DBN global parameters are obtained as shown in Figure 6. A sparse penalty factor [31], which conducts the RBM sparsity by creating a connection of weight and hidden unit bias, is proposed on the basis of cross entropy. Cross entropy is used to distinguish between vertebra and nonvertebra using two small probability distributions to prevent the phenomenon of vertebra homogeneity during training. The 1024 raw dimension features are given as input to the network. The dataset is split into training 70% (training80%-validation20%) and testing 30%; seven CT images are used for training (training-validation) and three for testing.
Appl. Sci. 2018, 8, x FOR PEER REVIEW 10 of 17 Figure 6. The learning curve of the model. Figure 6. The learning curve of the model.
The training data samples comprise the matrix with 538,624 × 1024 dimensions and validation samples having a matrix with 179,200 × 1024 dimensions. Five-hundred-and-thirty-eight-thousandsix-hundred-and-twenty-four samples were randomly chosen for training, and 179,200 samples were randomly chosen for validation. To achieve efficient training, the mini batch size was set to 128. The training samples were divided into 4028 mini batches, and validation samples were divided into 1400 mini batches. The architecture of PaDBN was selected as 1024-100-100-2, where the number of input nodes is 1024. The number of epochs in unsupervised pretraining is 50 to fully train every RBM of PaDBN. The 3000 epochs are used in fine-tuning. The model is trained to minimize the mean square error (MSE) between the predicted map and the ground truth. The output layer is used as a sigmoid layer to show the probability map of vertebra or nonvertebra objects. Our trained model can assess the labels of the arbitrarily sized image. Given a test image, we extract the overlapping patches of size 32 × 32 and feed them to the trained model to generate prediction probability maps. For the overlapped image patch, the final probability maps will be the average probability map of overlapping patches that can be used to realize vertebra segmentation. Thus, the generalization and stability of the trained model can be guaranteed.

Parameter Settings
The parameters for the model are listed as the overlapped patch size 32 × 32 pixel. The PaDBN model has two hidden layers. The numbers of nodes in each layer are 100 and 100, respectively. Two nodes in the output layer enable the model to select the two most relevant features to represent each patch. Thus, the deep learning features have two dimensions. To select a deep network structure, we used a random search [53] to select the structure for best performance. First, we tried to identify the range of the hyperparameters and after that, we selected the values randomly. Then, we trained our model with these selected values and we repeated this procedure until getting the best performance. The L2 regularization was set to 0.00001. The learning rate for pretraining was 0.05 and that for fine-tuning was 0.008. The momentum for pretraining was 0.5 and that for fine-tuning was 0.9. Same mini batch size 128 is used to achieve efficiency during pretraining and fine-tuning. We used 50 and 3000 epochs in pretraining and fine-tuning, respectively. The parameter combination scheme that achieves the top performance of our proposed method is shown in Table 1. The learning curve of the PaDBN model is shown in Figure 6. A normalized training MSE of 0.026 and a normalized validation MSE of 0.032 were obtained after training. The learning curve shows high divergence before 500 epochs and becomes stable after 1700 epochs in the left side of Figure 6 (top left), but the (top right) graph shows the overfitting problem because the validation curve considerably deviates from the training curve after 500 epochs. The correct training curve, as shown in the (top left) graph, is obtained using the heuristic approach. Figure 6 (bottom) shows the poor training MSE graph with high learning rate (bottom left) and low learning rate (bottom right). Thus, weight initialization is important in deep learning. Figure 7 illustrates a typical feature representation of first and second hidden layers learned by two-RBM layer PaDBN model. Each square shows the weight between one hidden node and the pixel of the original image at the same position. The gray pixel indicates zero in the weight matrix, and white pixels denote positive values. The left side of Figure 7 shows the 100 node RBM1, amplified images show its weight visualization, and the right side of Figure 7 shows the weight visualization of 100 node RBM2.
The segmentation results are shown in Figures 8-10 on different test axial images.       The performance of vertebra segmentation is evaluated by specificity, sensitivity, precision, overall accuracy, F1 score, Matthews correlation coefficient (MCC) [54], DICE similarity index (DSI) [55], and Jaccard similarity index (JSI) [56] for quantitative assessment. We drew the confusion matrix to perform classic evaluation using four variables: true positive, which refers to the patches The performance of vertebra segmentation is evaluated by specificity, sensitivity, precision, overall accuracy, F1 score, Matthews correlation coefficient (MCC) [54], DICE similarity index (DSI) [55], and Jaccard similarity index (JSI) [56] for quantitative assessment. We drew the confusion matrix to perform classic evaluation using four variables: true positive, which refers to the patches segmented correctly as vertebrae in the ground truth; false positive, which refers to the patches that are not classified as vertebrae but classified as vertebrae by method; true negative, which refers to the patches that are correctly classified as background; and false negative, which refers to the incorrectly classified patches as background. The proposed method achieves excellent results in classical measurements with an accuracy of 93.3%, a sensitivity of 91.1%, and a specificity of 93.4%. These results also confirm the improvement in the MCC results and indicate that our segmentation results show very satisfactory agreement with the ground truth ( Table 2). DSI measures the spatial overlapping between two binary images, and its value ranges from zero (nonoverlap) and one (perfect overlapped). The DSI is calculated for the individual vertebra and then averaged for overall vertebrae. Each vertebra in the segmentation reference is compared with the labels with which it has the largest overlapped. All automatic detected vertebrae, including the misclassified ones, are considered for this evaluation. Our proposed method improves the segmentation results of DSI with 86.1% similarity to the ground truth and JSI of 84.9% score (Table 3). This research demonstrates that our proposed method can learn the complex segmentation task of vertebrae. Classification and segmentation errors are most often found in the first and last vertebrae, in which the model has less context availability. These vertebrae are often invisible, thereby making their segmentation difficult. DICE score is lower for the starting thoracic vertebrae because of ribs and intervertebral influence that are connected to the vertebrae. Some vertebrae that are not shown in the label annotations are still segmented by our method, thereby causing misclassification and low DICE score. These factors mislead segmentation. To show the generalization of our proposed method, we utilize the CSI 2014 dataset on the vertebra segmentation (Table 4). Table 4. Our method compared with state-of-the-art methods on vertebrae segmentation.

DSI (%)
Y. Li et al. [15] 81.0 S. F. Qadri et al. [44] 85.0 A. Seitel et al. [57] 83.0 A. Sekuboyina et al. [58] 87.0 S. M. M. R. Al Arif et al. [41] 84.0 Our method 86.1 Compared with the findings of other related methods, our obtained results are satisfactory. For example, a mean DICE score of 85% is obtained in our preliminary work [44] for thoracic and lumbar vertebrae segmentation from CT images by using deep belief networks modeling but that has complex deep network architecture. DICE coefficient of 83% is reported in [57] by evaluation on thoracic and lumbar vertebrae, and the method utilizes the statistical multiobject shape and pose for vertebra segmentation. However, this model lacks automatic initialization and possible segmentation overlap at the boundaries for the closely spaced thoracic vertebra. Meanwhile, an average DICE score of 81% is obtained for lumbar vertebrae CT image segmentation using level set formulation in the presence of different levels of salt and pepper noises [15]. A mean DICE coefficient of 87% is achieved by deep learning method for spine segmentation, and the method is evaluated on thoracic and lumbar CT vertebrae images [58]. However, this model fails in the structural consistency of the vertebrae parts segmentation or sometimes entire vertebra from starting and ending slices. Another deep learning method [41] is proposed for cervical vertebrae segmentation from X-ray images and achieved DICE similarity coefficient of 84%; however, the patch-based center localization method has the limitation of not knowing which center belongs to which vertebra. Our method is implemented in MATLAB and thus consumes time to segment the vertebrae. A possible solution to improve the efficiency of our method is to implement it in Keras with TensorFlow backend in Python or in Caffe. In this way, the computational time will be reduced.

Conclusions
In this paper, we propose a PaDBN learning model for vertebra segmentation in CT images on publicly available CSI 2014 dataset. The proposed model achieves promising results and is therefore applicable to the clinical setting. To address the challenges in ensuring robustness of feature representation and the large variations in the vertebrae, we propose to extract deep learning features by the PaDBN method. Then, the learned features are utilized under the patch matching model to estimate the vertebra likelihood map of the target image. Our proposed model based on deep learning shows superior performance and is more precise, robust, and generalized than traditional models that are dependent on predefined shapes and edges for fine-tuning. We believe that its performance can be further improved with a larger and more representative dataset than the current tested dataset. Compared with state-of-the-art methods, our framework demonstrates an excellent performance in vertebra segmentation. Our future research will extend the proposed method to different image modalities and spinal structures and will validate the method on large image databases with various spinal pathologies.