Automatic Cephalometric Landmark Detection on X-ray Images Using a Deep-Learning Method

: Accurate automatic quantitative cephalometry are essential for orthodontics. However, manual labeling of cephalometric landmarks is tedious and subjective, which also must be performed by professional doctors. In recent years, deep learning has gained attention for its success in computer vision ﬁeld. It has achieved large progress in resolving problems like image classiﬁcation or image segmentation. In this paper, we propose a two-step method which can automatically detect cephalometric landmarks on skeletal X-ray images. First, we roughly extract a region of interest (ROI) patch for each landmark by registering the testing image to training images, which have annotated landmarks. Then, we utilize pre-trained networks with a backbone of ResNet50, which is a state-of-the-art convolutional neural network, to detect each landmark in each ROI patch. The network directly outputs the coordinates of the landmarks. We evaluate our method on two datasets: ISBI 2015 Grand Challenge in Dental X-ray Image Analysis and our own dataset provided by Shandong University. The experiments demonstrate that the proposed method can achieve satisfying results on both SDR (Successful Detection Rate) and SCR (Successful Classiﬁcation Rate). However, the computational time issue remains to be improved in the future.


Introduction
Cephalometric analysis is performed on skeletal X-ray images. This is necessary for doctors to make orthodontic diagnoses [1][2][3]. In cephalometric analysis, the first step is to detect landmarks in X-ray images. Experienced doctors are needed to identify the locations of the landmarks. Measurements of the angles and distances between these landmarks greatly assist diagnosis and treatment plans. The work is time-consuming and tedious, and the problem of intra-observer variability arises since different doctors may differ considerably in their identification of landmarks [4].
Under these circumstances, computer-aided detection, which can automatically identify landmarks objectively, is highly desirable. Studies on anatomical landmark detection have a long history. In 2014 and 2015, International Symposium on Biomedical Imaging (ISBI) launched challenges on cephalometry landmark detection [5,6], and several state-of-the-art methods were proposed. In 2015's challenge, Lindner and Cootes's method based on random-forest achieved the highest results, with a successful detection rate of 74.84% for a 2-mm precision range [7] (since landmark location cannot be exactly same with manual annotation, Errors within a certain range are acceptable. Usually, medically acceptable range is 2 mm). Ibragimov et al. achieved the second highest result in 2015's challenge, with a successful detection accuracy of 71.7% within a 2-mm range [8]. They applied Haar-like feature extraction with a random-forest regression, then making refinements with a global-context shape model. In 2017, Ibragimov et al. added a convolutional neural network for binary classification to their conventional method. They surpass the result of Lindner's a little bit, with around 75.3% prediction accuracy within a 2-mm range [9]. In 2017, Hansang Lee et al. proposed a deep learning based method which achieved not bad results but in a small resized image. He trained two networks to regress the landmark's x and y coordinate directly [10]. In 2019, Jiahong Qian et al. proposed a new architecture called CephaNet which improves the architecture of Faster R-CNN [11,12]. Its accuracy is nearly 6% higher than other conventional methods.
Despite the variety of techniques available, automatic cephalometric landmark detection remains insufficient due to its limited accuracy. In recent years, deep learning has gained attention for its success in computer vision field. For example, convolutional neural network models are widely used in problems like landmark detection [13,14], image classification [15][16][17] and image segmentation [18,19]. Trained models' performances surpass that of human beings in many applications. Since direct regression of several landmarks is a highly non-linear mapping, which is difficult to learn [20][21][22][23]. In our method, we only try to detect one key point in one patch image. We learn a non-linear mapping function for only one key point. Each key point has its corresponding non-linear mapping function. So we can achieve more accurate detection of key point than other methods.
In this paper, we propose a two-step method for the automatic detection of cephalometric landmarks. First, we get the coarse landmark location by registering the test image to a most similar image in the training set. Based on the registration result, we extract an ROI patch centered at the coarse landmark location. Then, by using our pre-trained network with the backbone of ResNet50, which is a state-of-the-art convolutional neural network, we detect the landmarks in the extracted ROI patches to make refinements. The experiments demonstrate that our two-step method can achieve satisfying results compared with other state-of-the-art methods.
The remainder of this paper is organized as follows: We describe the study materials in Section 2. In Section 3, we explain the proposed method in detail. Then, in Section 4, we describe the experiments and discuss the results. Lastly, we present the conclusion in Section 5.

Training Dataset
The training dataset is the 2015 ISBI Grand Challenge training dataset [6]. It contains 150 images, each of which is 1935 × 2400 pixels in TIFF format. Each pixel's size is 0.1 × 0.1 mm. The images were labeled by two experienced doctors. We use the average coordinate value of the two doctors as ground-truth for training.

Data Augmentation
The size of an X-ray image is 1935 × 2400 pixels. The scale is too huge to input the whole image into a CNN to detect the locations of all 19 landmarks simultaneously. Therefore, we decide to detect landmarks in small patches. In each patch, only one landmark will be detected. We train one model for one landmark. Overall, we have 19 landmarks since there are 19 landmarks to be detected. These models have the same architecture but different weights, as presented in Figure 2.
In order to do data augmentation, we select 400 patches randomly for each landmark in one training image. Each patch is 512 × 512 pixels. The landmark could be everywhere in the patch. Its coordinate (x,y) in the ROI patch is used as teaching signal (ground truth) for training. Then, we resize them into 256 × 256 patches, as presented in Figure 3. This means we will have 60,000 (400 × 150) images to train one model for landmark i (i = 1, 2, ... 19).   . The way we crop ROI patches for each landmark: the blue dot is a target landmark, the red boxes are extracted 512 × 512 ROI patches. We can see the landmark is included in all the ROI patches. We randomly extract 400 patches for one landmark in one x-ray image for training. Then we resize the ROI patches into 256 × 256 to make predictions.

Test Dataset
We have two test datasets. Both datasets are tested through the models, which are trained with the ISBI training dataset.

International Symposium on Biomedical Imaging (ISBI) Test Dataset
The first test dataset from the ISBI Grand Challenge includes two parts: Test Dataset 1 with 150 test images and test dataset 2 with 100 images. The ISBI test images are collected with the same machine as the training data, and each image is 1935 × 2400 in TIFF format.

Our Dataset
Our own dataset is provided by Shandong University, China. In contrast to the ISBI dataset, the data are taken with different machines. There are 100 images, each of which is 1804 × 2136 in JPEG format.
Also, the landmarks labeled by the doctors are not exactly the same as those in the ISBI dataset (13 of them are same). We used labels for sella turcica, nasion, orbitale, porion, subspinale, supramentale, pogonion, menton, gnathion, gonion, lower incisal incision, upper incisal incision, and anterior nasal spine. Other landmarks were not labeled by the doctors.

Region Of Interest (ROI)
Extracted ROI patches are used as input images in our corresponding models. The method of extraction is described in detail in Section 3.

Overview
The method is a two-step method: ROI extraction and Landmark detection. For ROI extraction, we crop patches by registering the test image to training images, which have annotated landmarks. Then, we use pre-trained CNN models with the backbone of ResNet50, which is a state-of-the-art CNN, to detect the landmarks in the extracted ROI patches. The procedure is presented in Figure 4.

Automatic ROI Extraction
Because of the limitation of the input image size, we aim to extract small ROI patches automatically. That is, we extract ROI patches that include landmark i(i = 1, 2, 3. . . 19). To extract the ROI automatically, we use registration to locate a coarse location of the landmark. Next, we extract the ROI patch centered on the coarse location. Thus, the landmark location we aim to detect will be included in the ROI patch.
In this process, the test image is registered to training images, which have annotated landmarks. We consider the test images as moving images and the training images as reference images. The type of transform that we used is translation transform. The optimizer is gradient descent, which we used to update the locations of the moving images. We apply the intensity-based registration method and calculate the mean squared error (MSE) between the reference image and moving image.
After registration, we copy the landmark locations of the reference images (training images) to moving images (test images). We consider the landmark location of a reference image as the center of the ROI patch, extract a 512 × 512 resolution patch image on the moving image, and then resize it to a 256 × 256 resolution patch for the test. The extracted images will be treated as input to our trained convolutional neural network (CNN); thus, we can detect the corresponding landmark in the patch image. The registration results are presented in Figure 5.
Since people's heads vary in shape, randomly choosing one image among the training images as a reference image is not sufficient, and the landmark to be detected will not be included in the ROI. To avoid this situation, we iterate all the training images, which means we perform registration 150 times for one test image (the total number of training images is 150). Then, as the reference image, we choose the training image that has the smallest square error with test images. This enables us to find the closest training sample to the test images. For computational time, different references and moving images vary a lot. The shortest only took a few seconds for one registration, while the longest could take more than one minute. In all, the average time for registering one image to all training samples is around twenty minutes. The procedure is presented in Figure 6. In the image, we take the reference image's landmarks(Training images) as center of the patch image, then we draw a bounding box (ROI) in the test image, as the input to our corresponding trained models. Red dots are the reference images' landmarks, green dots are test image's landmarks, which are the landmarks we aim to detect.

Network Architecture
We use the ResNet50 architecture [24] as our backbone to extract features in ROI patches. CNN can extract useful features automatically for different computer vision tasks. We use regression method to estimate the landmark location after feature extraction. In our experiment, we choose to use fully connected layer for the estimation of landmark location as a regression problem. We first flatten all the features. Then, we add one fully connected layer, which directly outputs the coordinate of the landmark in the patch.
The state-of-the-art architecture ResNet is widely used for its unique architecture. It has convolutional blocks and identity blocks. By adding former information to the later layers, ResNet solves the gradient vanishing problems effectively. The overall architecture of ResNet50 is presented in Figure 7. The architecture of ResNet50, the identity block and convolutional block add the former input to the later output. For identity block, it directly adds the former input to the later output. While for convolutional block, before adding, the former input first goes through a convolutional layer to make it the same size with later output.

Cost function
The cost function we used during training is mean squared error (MSE), since we want to make the predicted landmark become as close as possible to the ground-truth landmark location during training. In this case, MSE can be written as in Equation (1): where x i and y i indicate the ground-truth coordinate, andx i andŷ i indicate the estimated coordinate.

Implementation
For the implementation of our experiments, we use Titan X GPU (12 GB) to help accelerating training. We use Python programming language along with deep learning tools Tensorflow and Keras in our experiments. We use Adam optimizer and set learning rate to 0.001. Ten percent of our training data (60,000 × 10% = 6000) are separated for validation during training procedure and early stopping is applied to find the best weights.

Evaluation measures
According to the 2015 ISBI challenges on cephalometry landmark detection [6], we use the mean radial error (MRE) to make evaluation. The radial error is defined in Equation (2). ∆x and ∆y are the absolute differences of the estimated coordinate and ground-truth coordinate of x-axis and y-axis.
The mean radial error (MRE) is defined in Equation (3): Because the predicted results have some difference with the ground-truth. If the difference is within some range, we consider it correct in that range. In our experiment, we evaluate the range of z mm (where z = 2, 2.5, 3, 4). For example, if the radial error is 1.5 mm, we consider it as a 2 mm success. Equation (4) explains the meaning of successful detection rate (SDR), where N a indicates the number of accurate detections, and N indicates total number of detections.

ISBI Test Dataset
The results for ISBI public data are presented in Tables 1 and 2. Table 1 presents the SDR, and Table 2 presents the MRE results.  We compare our detection results with other benchmarks' results. Table 3 shows the comparison. Table 3. SDR of the proposed method compared with other benchmarks for ISBI 2015 challenges on cephalometry landmark detection Testset1 and Testset2.

Our Dataset
We use our trained model(use ISBI training dataset) to test our own dataset directly. We also register all images first. Since there are some different landmarks labeled in our own dataset, in the experiment, we only test same parts' landmarks with ISBI dataset. The results are presented in Tables 4 and 5. Table 4. Results of 2 mm, 2.5 mm, 3 mm and 4 mm accuracy on our own dataset, since we do not have label for landmark 13,14,15,16,17,19, we test other landmarks exclude these ones.

Skeletal Malformations Classification
For the ISBI Grand Challenge, the class labels of skeletal malformations based on ANB, SNB, SNA, ODI, APDI, FHI, FMA, and MW were also annotated for each subject. The skeletal malformation classifications for ANB, SNB, SNA, ODI, APDI, FHI, FMA, and MW are defined in Table 6 The Successful Classification Rate (SCR) for ANB, SNB, SNA, ODI, APDI, FHI, FMA, and MW with the detected landmarks are presented in Table 7 (We run the official python program offered by ISBI Grand Challenge [6] to automatically get the SCR results). To make a comparison, we also show the SCR by the state-of-the-art methods in Table 7. As presented in Table 7, some of the SCRs of the proposed method are much higher than those of the existing methods (eg: ANB, SNA, ODI, FHI). For SNB measure, Lindner [7] is 0.1% percent higher than our result.

Discussion
For registration, since people's heads vary in shape, even though we selected the closest image to the training data as the reference image for each test image, there are still missed situations. This means that after the registration, the patch we created for the test does not include the ground-truth landmark. For the ISBI dataset, there is only one missed patch, and the rate is about 0.0002 (1/(19 × 250)), as presented in Figure 9. Overall, it has little impact on the results. For Testset2 of the ISBI Grand Challenge, we can see that Landmark 3, Landmark 6, Landmark 13, and Landmark 16's have relatively low accuracy. However, the process works fine on Testset1. After we visualize the testing result, we find out that the anatomy of those failed cases are very different from the successfully detected ones. Since our training dataset has only 150 images, even though we did data augmentation, we still cannot deny the fact that people's skeletal anatomies are so different that many types are not included in the training dataset. When we use our trained models to make predictions on those with huge anatomy differences, we will achieve unsatisfying results. We think this explains the poor performance on those landmarks in Testset2. We show an example in Figure 10.
At first, we try to input all the X-ray images to predict all 19 landmarks simultaneously. We performed conventional data augmentation on training data, such as rotation, cropping, and so on. However, the result is not satisfying; although the location of all the landmarks is very similar to the ground truth, the predicted results are very far from the ground-truth result, as presented in Figure 11.
When we performed skeletal malformation classification based on the detected result, we found out that the result varies a lot when we use the annotation data of a different doctor as the ground truth (There are two annotation data made by two doctors (junior doctor and senior doctor) in ISBI Grand Challenge 2015; because their landmark locations are different, this results in the variability of the skeletal malformation classification. Table 8 presents the proportion that two doctors have the same classification label.). The result we showed previously was based on the annotation data of the junior doctor. If we used the annotation data of the senior doctor as the ground truth when performing classification, the SCR would be lower. It proves that annotation data have great influence on the results. During training, we use the average of the annotated landmark locations of the two doctors as the ground truth. When we use only the senior doctor's annotated landmark locations as the ground truth during the training, the SDR would also drop, which also proves that the annotation data greatly affects the final results.  We can see that the green dot is not included in the bounding box we cropped.

Conclusions
We proposed an approach to automatically predict landmark location and used a deep learning method with very small training data, with only 150 X-ray training images. The results outperformed other benchmarks' results in all SDR and nearly all SCR, which proves that the proposed method are effective for cephalometric landmark detection. The proposed method could be used for landmark detection in clinical practice under the supervision of doctors.
For future work, we are developing a GUI for doctors to use. The system offers either automatic or semi-automatic landmark detection. In the automatic mode, the system will automatically extract an ROI (based on registration) and select a proper model for each landmark. In the semi-automatic mode, the doctor needs to give a bounding box extract ROI manually and select a corresponding model for each landmark detection, which can reduce computational time. Since we used simple registration in this study and it took a long time to register one test image with all the training images, around twenty minutes as we mentioned before. Under such case, we aim to design a better method that reduces the registration time in testing phase, to make the automatic detection more efficient. For example, maybe we will also utilize deep learning method, either to do registration or just simply to regress a coarse landmark region, to replace the rigid registration we used in this paper. Moreover, because we detected all the landmarks in patch images, we did not take global-context information (i.e., the relationship among all the landmarks) into consideration. In future work, we want to utilize this global-context information to check whether we can achieve better performance. For example, maybe we will try to change the network architecture to allow the network to utilize whole image's features to better and faster locate the coarse locations of landmarks.

Conflicts of Interest:
The authors declare no conflict of interest.