Deep Convolutional Neural Networks Based Analysis of Cephalometric Radiographs for Differential Diagnosis of Orthognathic Surgery Indications

The aim of this study was to evaluate the deep convolutional neural networks (DCNNs) based on analysis of cephalometric radiographs for the differential diagnosis of the indications of orthognathic surgery. Among the DCNNs, Modified-Alexnet, MobileNet, and Resnet50 were used, and the accuracy of the models was evaluated by performing 4-fold cross validation. Additionally, gradient-weighted class activation mapping (Grad-CAM) was used to perform visualized interpretation to determine which region affected the DCNNs’ class classification. The prediction accuracy of the models was 96.4% for Modified-Alexnet, 95.4% for MobileNet, and 95.6% for Resnet50. According to the Grad-CAM analysis, the most influential regions for the DCNNs’ class classification were the maxillary and mandibular teeth, mandible, and mandibular symphysis. This study suggests that DCNNs-based analysis of cephalometric radiograph images can be successfully applied for differential diagnosis of the indications of orthognathic surgery.


Introduction
Diagnosing the need for orthognathic surgery is an important issue in the field of orthodontics and oral and maxillofacial surgery [1]. Orthognathic surgery is expensive and associated with the risks of general anesthesia. Therefore, patients tend to prefer orthodontic treatment to orthognathic surgery. However, there are cranial structural problems that cannot be solved by orthodontic treatment alone. A prominent chin, retrusive mandible, and jaw asymmetry can only be corrected by orthognathic surgery. Orthognathic surgery is also chosen when there is a limit to the esthetic improvement that can be achieved with orthodontic treatment alone [2,3].
The choice of orthognathic surgery depends on the evaluation of the objective data and not just the patient's decision. The determination of whether a patient's skeletal problem can be solved by an orthodontic approach or if it requires a surgical approach is a very sensitive issue [4]. Experienced orthodontists and oral surgeons can evaluate the need of a case, but this can be difficult if the clinician is less experienced or if it is a borderline case.

Materials and Methods
A total of 333 individuals who visited the Korea University Ansan Hospital for orthodontic evaluation between 2014 and 2019 were enrolled in this study. Cephalometric radiograph images were used for clinical examination, and 159 and 174 patients were indicated for orthognathic surgery and orthodontic treatment, respectively (Table 1). The datasets were randomly split. Forty cases were used solely to evaluate the model, while the training and validation sets were comprised of the remaining 293 cases. The training set consisted of 220 cases and the validation set consisted of 73 cases. This research protocol was approved by the Institutional Review Board, Korea University Ansan Hospital (no. 2020AS0062).
In the data pre-processing step, 50 landmarks were detected using an automated landmark detection method using the VGG16 model. All cephalometric radiographs were aligned using the Frankfort-horizontal plane, and a minimum box containing all 50 landmarks was selected ( Figure 1). Following this, a square box was selected as the lower part where the main structures including the maxilla and mandible could be included. The data were cropped by giving a margin of 10%. The data size was down-sampled to a 256 × 256 pixel size.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 12 The datasets were randomly split. Forty cases were used solely to evaluate the model, while the training and validation sets were comprised of the remaining 293 cases. The training set consisted of 220 cases and the validation set consisted of 73 cases. This research protocol was approved by the Institutional Review Board, Korea University Ansan Hospital (no. 2020AS0062).
In the data pre-processing step, 50 landmarks were detected using an automated landmark detection method using the VGG16 model. All cephalometric radiographs were aligned using the Frankfort-horizontal plane, and a minimum box containing all 50 landmarks was selected ( Figure 1). Following this, a square box was selected as the lower part where the main structures including the maxilla and mandible could be included. The data were cropped by giving a margin of 10%. The data size was down-sampled to a 256 × 256 pixel size.  The three deep learning models used in this study were Modified-Alexnet, MobileNet, and Resnet50 ( Figure 2).  The three deep learning models used in this study were Modified-Alexnet, MobileNet, and Resnet50 ( Figure 2). Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 12 Modified-Alexnet combined the separated modules in the basic Alexnet for ease of implementation, and had an input image size of 227 × 227 pixels [25]. MobileNet and Resnet50 had a default size of 224 × 224 pixels for the input image [26].
Modified-Alexnet used He-normal initialization as the initial weight, and MobileNet and Resnet50 used the weight provided by ImageNet as the initial weight. The training was conducted using the stochastic gradient descent (SGD) optimizer [1,27]. The initial learning rate was set at 0.002 and the batch size was set at 32. A total of 150 epochs were performed, and the learning rate was adjusted to a 0.1 factor when the validation loss value did not improve more than 1e-6 during 20 epochs. At the point when validation accuracy no longer increased significantly, the study was terminated and the model was evaluated. Modified-Alexnet combined the separated modules in the basic Alexnet for ease of implementation, and had an input image size of 227 × 227 pixels [25]. MobileNet and Resnet50 had a default size of 224 × 224 pixels for the input image [26].
Modified-Alexnet used He-normal initialization as the initial weight, and MobileNet and Resnet50 used the weight provided by ImageNet as the initial weight. The training was conducted using the stochastic gradient descent (SGD) optimizer [1,27]. The initial learning rate was set at 0.002 and the batch size was set at 32. A total of 150 epochs were performed, and the learning rate was adjusted to a 0.1 factor when the validation loss value did not improve more than 1e-6 during 20 epochs. At the point when validation accuracy no longer increased significantly, the study was terminated and the model was evaluated.
The data inputs of the 256 × 256 pixel images were randomly cropped to fit each model (227 × 227 pixels for Modified-Alexnet and 224 × 224 pixels for both MobileNet and Resnet50) and randomly flipped horizontally to overcome overfitting. Dropout and batch normalization were also implemented. Equalization pre-processing was performed prior to input, and normalization was performed to set the maximum value of the image to 1. Learning sets were randomly divided into four sets for 4-fold cross validations, and each success rate was evaluated twice and statistically processed [28].
We also used the gradient-weighted class activation mapping (Grad-CAM) technique to see where the AI was interested [29,30]. By expressing difference in color, it was possible to know which area received the greatest judgment by the AI model. By displaying the region of interest (ROI), it became a more explainable AI model.

Results
The average performance of Modified-Alexnet after two sets of 4-fold cross validation was 97.6% in the training set, 95.6% in the validation set, 91.9% in the test set, and 96.4% in total. For MobileNet, it was 98.5% in the training set, 92.7% in the validation set, 83.8% in the test set, and 95.4% in total, and for Resnet50 it was 98.1% in the training set, 94.5% in the validation set, 83.8% in the test set, and 95.6% in total ( Figure 3).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 12 The data inputs of the 256 × 256 pixel images were randomly cropped to fit each model (227 × 227 pixels for Modified-Alexnet and 224 × 224 pixels for both MobileNet and Resnet50) and randomly flipped horizontally to overcome overfitting. Dropout and batch normalization were also implemented. Equalization pre-processing was performed prior to input, and normalization was performed to set the maximum value of the image to 1. Learning sets were randomly divided into four sets for 4-fold cross validations, and each success rate was evaluated twice and statistically processed [28].
We also used the gradient-weighted class activation mapping (Grad-CAM) technique to see where the AI was interested [29,30]. By expressing difference in color, it was possible to know which area received the greatest judgment by the AI model. By displaying the region of interest (ROI), it became a more explainable AI model.

Results
The average performance of Modified-Alexnet after two sets of 4-fold cross validation was 97.6% in the training set, 95.6% in the validation set, 91.9% in the test set, and 96.4% in total. For MobileNet, it was 98.5% in the training set, 92.7% in the validation set, 83.8% in the test set, and 95.4% in total, and for Resnet50 it was 98.1% in the training set, 94.5% in the validation set, 83.8% in the test set, and 95.6% in total ( Figure 3). The standard deviations of the success rate for Modified-Alexnet, MobileNet, and Resnet50 were 1.5%, 1.7%, and 2.1% in the training set; 3.5%, 7.6%, and 6.1% in the validation set; and 1.1%, 2.2%, and 4.7% in the test set, respectively.
The screening performances of the three DCNN models tested in this study are displayed in Table 2. It can be observed that Modified-Alexnet achieved the highest performance, with an accuracy of 0.919 (95% CI 0.888 to 0.949), sensitivity of 0.852 (95% CI 0.811 to 0.893), and specificity of 0.973 (95% CI 0.956 to 0.991).   The standard deviations of the success rate for Modified-Alexnet, MobileNet, and Resnet50 were 1.5%, 1.7%, and 2.1% in the training set; 3.5%, 7.6%, and 6.1% in the validation set; and 1.1%, 2.2%, and 4.7% in the test set, respectively.
The screening performances of the three DCNN models tested in this study are displayed in Table 2. It can be observed that Modified-Alexnet achieved the highest performance, with an accuracy of 0.919 (95% CI 0.888 to 0.949), sensitivity of 0.852 (95% CI 0.811 to 0.893), and specificity of 0.973 (95% CI 0.956 to 0.991).   In cases of successful prediction, the ROI identified by class activation mapping (CAM) was mainly focused on the maxillary and mandibular teeth, mandibular symphysis, and mandible. In cases of prediction failure, the highlights were turned elsewhere or the ROI was scattered throughout, indicating that they were not properly focused ( Figure 5).

Discussion
Most of the previous AI research related to the diagnosis of orthodontics was conducted by considering the landmark points and calculating the measurement values [23,24]. This method is not only influenced by the measurement value that is entered, but also has the disadvantage of introducing many errors in the process of considering the landmark points. Recently, an automated landmark detection method has been introduced, but the performance is comparable or slightly less than manual detection by specialists [31]. Furthermore, there is another disadvantage of overfitting, which can easily occur when the measurement values have a similar meaning input in the machine- In cases of successful prediction, the ROI identified by class activation mapping (CAM) was mainly focused on the maxillary and mandibular teeth, mandibular symphysis, and mandible. In cases of prediction failure, the highlights were turned elsewhere or the ROI was scattered throughout, indicating that they were not properly focused ( Figure 5).  In cases of successful prediction, the ROI identified by class activation mapping (CAM) was mainly focused on the maxillary and mandibular teeth, mandibular symphysis, and mandible. In cases of prediction failure, the highlights were turned elsewhere or the ROI was scattered throughout, indicating that they were not properly focused ( Figure 5).

Discussion
Most of the previous AI research related to the diagnosis of orthodontics was conducted by considering the landmark points and calculating the measurement values [23,24]. This method is not only influenced by the measurement value that is entered, but also has the disadvantage of introducing many errors in the process of considering the landmark points. Recently, an automated landmark detection method has been introduced, but the performance is comparable or slightly less than manual detection by specialists [31]. Furthermore, there is another disadvantage of overfitting, which can easily occur when the measurement values have a similar meaning input in the machine-

Discussion
Most of the previous AI research related to the diagnosis of orthodontics was conducted by considering the landmark points and calculating the measurement values [23,24]. This method is not only influenced by the measurement value that is entered, but also has the disadvantage of introducing many errors in the process of considering the landmark points. Recently, an automated landmark detection method has been introduced, but the performance is comparable or slightly less than manual detection by specialists [31]. Furthermore, there is another disadvantage of overfitting, which can easily occur when the measurement values have a similar meaning input in the machine-learning model. Even if automated landmark detection is extremely precise, there is a limitation with the neural network model because selecting the input of the artificial neural networks (ANNs) model is an ambiguous region. In this study, precision of automated landmark detection was not important because we used a landmark search only to find the range for cropping the image. A deep learning algorithm is an algorithm that extracts features of an image using the convolution filter and pooling layer and then analyzes patterns using them ( Figure 6). Figure 6 shows how the features were extracted after the original images in the top row passed the convolutional filters in the left column. Many deep learning models have been improved and developed based on the filter size, type, location and combination, and various ideas. learning model. Even if automated landmark detection is extremely precise, there is a limitation with the neural network model because selecting the input of the artificial neural networks (ANNs) model is an ambiguous region. In this study, precision of automated landmark detection was not important because we used a landmark search only to find the range for cropping the image. A deep learning algorithm is an algorithm that extracts features of an image using the convolution filter and pooling layer and then analyzes patterns using them ( Figure 6). Figure 6 shows how the features were extracted after the original images in the top row passed the convolutional filters in the left column. Many deep learning models have been improved and developed based on the filter size, type, location and combination, and various ideas.
The Alexnet architecture was the winning model of the ILSVRC in 2012, and was a successful model that led to the development of deep learning [25]. It is a basic model of deep learning with a simple but powerful structure. Until recently, Resnet50 has been in the spotlight as a state-of-the-art model for image classification problems [17]. MobileNet is a model that can be trained with relatively simple calculations to make it easy to use with mobile phones, and its complexity is halfway between Alexnet and Resnet50 [32]. In this study, Resnet50 and MobileNet showed that the fit of the training set was too fast and the fit of the validation set followed slowly. On the other hand, Modified-Alexnet showed that the fit of The Alexnet architecture was the winning model of the ILSVRC in 2012, and was a successful model that led to the development of deep learning [25]. It is a basic model of deep learning with a simple but powerful structure. Until recently, Resnet50 has been in the spotlight as a state-of-the-art model for image classification problems [17]. MobileNet is a model that can be trained with relatively simple calculations to make it easy to use with mobile phones, and its complexity is halfway between Alexnet and Resnet50 [32].
In this study, Resnet50 and MobileNet showed that the fit of the training set was too fast and the fit of the validation set followed slowly. On the other hand, Modified-Alexnet showed that the fit of the training set was similar to that of the validation set ( Figure 7). This also led to differences in success rates for the test set between the three models.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 12 the training set was similar to that of the validation set ( Figure 7). This also led to differences in success rates for the test set between the three models. The ILSVRC is a competition that categorizes the entire image set into 1000 subclasses [17,25]. In this study, the cephalometric radiographs were classified into two types, orthognathic surgery and orthodontic treatment, without the complexity of 1000 subclasses. Therefore, more complex and deeper layers, such as Resnet50, were overfit for a simple problem. Modified-Alexnet, on the other hand, does not have high complexity, but it includes several techniques for deep learning (L2 regularization, dropout, etc.), which improves performance [25].
In the ILSVRC, it is possible to learn using several data, but it is not easy to get a sufficiently large data size when there is a limit in the amount of data collection, such as medical data. In this study, up to 2048 images could be generated in one image by randomly cropping and flipping at a basic image size of 256 × 256 pixels. This process helped to make the learning in this study meaningful, even with a small amount of data. This study provides hints on the strategies to use when there is a limited amount of data.
If the clinician takes pre-processing steps to obtain the measured values by taking or tracing the landmark points with the cephalometric radiographs, the precision or measured value of the The ILSVRC is a competition that categorizes the entire image set into 1000 subclasses [17,25]. In this study, the cephalometric radiographs were classified into two types, orthognathic surgery and orthodontic treatment, without the complexity of 1000 subclasses. Therefore, more complex and deeper layers, such as Resnet50, were overfit for a simple problem. Modified-Alexnet, on the other hand, does not have high complexity, but it includes several techniques for deep learning (L2 regularization, dropout, etc.), which improves performance [25].
In the ILSVRC, it is possible to learn using several data, but it is not easy to get a sufficiently large data size when there is a limit in the amount of data collection, such as medical data. In this study, up to 2048 images could be generated in one image by randomly cropping and flipping at a basic image size of 256 × 256 pixels. This process helped to make the learning in this study meaningful, even with a small amount of data. This study provides hints on the strategies to use when there is a limited amount of data.
If the clinician takes pre-processing steps to obtain the measured values by taking or tracing the landmark points with the cephalometric radiographs, the precision or measured value of the landmark point detection may vary depending on the clinician's ability. However, diagnosing with the image itself reduces the chances of such an intervention. Moreover, the information that has not been considered previously can be considered together, which means that the meaning and effect of deep learning can be obtained more reliably.
One of the biggest diagnostic differences between indications of orthognathic surgery and orthodontic treatment is determining whether the skeletal differences between the maxilla and mandible can be overcome [4]. In this study, Grad-CAM was used to determine whether the deep learning AI model considered and evaluated the correct region. Grad-CAM is a generalized version of CAM that can be applied to AI models without global average pooling [29,30]. The highlighted CAM gives us insight into what it is seeing and evaluating why it failed. This was not different from the real situation that the actual clinician was seeing and evaluating.
The limitation of this study was that comparison of the performance was limited due to the lack of data. When assessed with more data, different results can be obtained, which will need to be addressed in future studies. Additionally, future studies should analyze the impact of better model construction and their comparison with other models. For example, in the Resnet model, we can compare the accuracy difference according to the number of blocks. Compared to conventional machine learning, potential issues that deep learning may have include image quality problems, differences in X-ray devices, and image distortion caused by disturbing factors such as dental prostheses.
The significance of this study is that it is the first study to perform differential diagnosis of the indications of orthognathic surgery and orthodontic treatment based on image, not measurements, and it showed a significant success rate. Compared with previous studies conducted with ANNs, the diagnostic performance of orthognathic surgery increased from 95.8% to 96.4% in total [33]. Most of the artificial intelligence research in the field of orthodontics is still limited to ANNs. AI models have been applied to the determination of extraction or not, extraction patterns, and anchorage patterns [23,24]. The image feature analysis and classification technique using DCNNs has had an enormous impact on the entire medical field, and the same is true in the field of orthodontics and oral surgery.

Conclusions
In this study, the DCNN approaches to the cephalometric radiograph image-based differential diagnosis for indications of orthognathic surgery were successfully applied and showed a high success rate of 95.4%-96.4%. Visualization with Grad-CAM also showed the potential for descriptive AI and confirmed the ROI required for differential diagnosis. Differential diagnosis using the whole image rather than evaluation with a specific measurement is a characteristic part of diagnosis using deep learning, and it also has the advantage of taking into account subtleties that are not represented by measurements. Therefore, the results of this study suggest that DCNNs could play an important role in the differential diagnosis of the indications of orthognathic surgery. Further research will be needed with the application of various deep learning structures, an appropriate number of datasets, and sets of images with verified labels.