Effectiveness of Human–Artificial Intelligence Collaboration in Cephalometric Landmark Detection

Detection of cephalometric landmarks has contributed to the analysis of malocclusion during orthodontic diagnosis. Many recent studies involving deep learning have focused on head-to-head comparisons of accuracy in landmark identification between artificial intelligence (AI) and humans. However, a human–AI collaboration for the identification of cephalometric landmarks has not been evaluated. We selected 1193 cephalograms and used them to train the deep anatomical context feature learning (DACFL) model. The number of target landmarks was 41. To evaluate the effect of human–AI collaboration on landmark detection, 10 images were extracted randomly from 100 test images. The experiment included 20 dental students as beginners in landmark localization. The outcomes were determined by measuring the mean radial error (MRE), successful detection rate (SDR), and successful classification rate (SCR). On the dataset, the DACFL model exhibited an average MRE of 1.87 ± 2.04 mm and an average SDR of 73.17% within a 2 mm threshold. Compared with the beginner group, beginner–AI collaboration improved the SDR by 5.33% within a 2 mm threshold and also improved the SCR by 8.38%. Thus, the beginner–AI collaboration was effective in the detection of cephalometric landmarks. Further studies should be performed to demonstrate the benefits of an orthodontist–AI collaboration.


Introduction
In orthodontics, detection of cephalometric landmarks refers to the localization of anatomical landmarks of the skull and surrounding soft tissues on lateral cephalograms. Since the introduction of lateral cephalograms by Broadbent and Hofrath in 1931, this approach has contributed to the analysis of malocclusion and has become a standardized diagnostic method in orthodontic practice and research [1]. In the last decade, an advanced machine-learning method called "deep learning" has received attention. Several studies have been conducted to improve the accuracy of landmark identification using lateral cephalograms. Deep learning-based reports using convolutional neural networks (CNNs) have achieved remarkable results [2][3][4][5]. These results suggest that deep learning using CNNs can assist dentists to reducing clinical problems related to orthodontic diagnosis such as tediousness, time wastage, and inconsistencies within and across orthodontists.
For the detection of anatomical landmarks, the Institute of Electrical and Electronics Engineers (IEEE) International Symposium on Biomedical Imaging (ISBI) 2015 released an open dataset for training and testing of cephalograms. Despite the limited number of annotated cephalograms, many CNN-based approaches have been proposed to solve the problem associated with the detection of anatomical landmarks. In 2017, Arik et al. introduced a framework that employed a CNN to recognize landmark appearance patterns and subsequently combined it with a statistical shape model to refine the optimal positions of all landmarks [6]. To address the restricted availability of medical imaging data for network learning with respect to the localization of anatomical landmarks, Zhang et al. proposed a two-stage task-oriented deep neural network method [7]. In addition, Urschler et al. proposed a unified framework that incorporated the image appearance information as well as geometric landmark configuration into a unified random forest framework, which was optimized iteratively to refine joint landmark predictions using a coordinate descent algorithm [8]. Recently, Oh et al. proposed the deep anatomical context feature learning (DACFL) model, which employs a Laplace heatmap regression method based on a fully convolutional network. Its main mechanism is accomplished using two main schemes: local feature perturbation (LFP) and anatomical context (AC) loss. LFP can be considered a data augmentation method based on prior anatomical knowledge. It perturbs the local pattern of the cephalogram, forcing the network to seek relevant features globally. AC loss can result in a large cost when the predicted anatomical configuration of landmarks differs from the ground-truth configuration. The anatomical configuration considers the angles and distances between all landmarks. Since the proposed system follows an end-to-end learning method, only a single feed-forward execution is required in the test phase to localize all landmarks [2].
Most research to date has focused on head-to-head comparisons between artificial intelligence (AI)-based systems and dentists for the localization of cephalometric landmarks [2][3][4][5][6][9][10][11][12][13][14][15][16]. Previous studies have shown that AI is equivalent or even superior to experienced orthodontists under experimental conditions [11,13]. Rapid developments in AI-based diagnosis have made it imperative to consider the opportunities and risks of new diagnostic paradigms. In fact, competition between humans and AI is against the nature and purpose of AI. Therefore, AI support for human diagnosis may be more useful and practical. The competitive view about AI is evolving based on studies indicating that a human-AI collaboration approach is more promising. The impact of human-AI collaboration on the accuracy of cephalometric landmark detection has not been evaluated to date. This leads to the following question: can a human-AI collaboration perform better than humans or AI alone in cephalometric landmark detection?
Among the previous CNN models, DACFL outperformed other state-of-the-art methods and achieved high performance in landmark identification on the IEEE ISBI 2015 dataset [2]. Therefore, our study aimed to evaluate the effect of DACFL-based support on the clinical skills of beginners in cephalometric diagnosis. Furthermore, we used a private dataset to evaluate the performance of the DACFL model in clinical applications. Data are expressed mean ± standard deviation (SD) for age and N (%) for gender, class I, II division 1, II division 2, and III.
In the subsequent step, 10 images from the test data were extracted randomly to evaluate the impact of human-AI collaboration on the detection of cephalometric landmarks ( Figure 1). treatment. In the first step, 1193 images were randomly selected as training data and 100 images were used as test data. The characteristics of data are listed in Table 1. Data are expressed mean ± standard deviation (SD) for age and N (%) for gender, class I, II division 1, II division 2, and III.
In the subsequent step, 10 images from the test data were extracted randomly to evaluate the impact of human-AI collaboration on the detection of cephalometric landmarks ( Figure 1). Figure 1. Workflow diagram of the study plan. In step 1, JBNU dataset including 1193 images for training and 100 images for testing was used to evaluate the performance of the DACFL model in clinical applications. In step 2, 10 images were extracted randomly from JBNU test data to evaluate the effect of DACFL-based support on the clinical skills of beginners in cephalometric diagnosis. Abbreviations: AI, artificial intelligence; DACFL, deep anatomical context feature learning; JBNU, Jeonbuk National University.

Manual Identification of Cephalometric Landmarks
Altogether, 41 landmarks were manually identified by dental residents at the Department of Pediatric Dentistry, Jeonbuk National University Dental Hospital, South Korea (Supplementary Table S1). A modified version of a commercial cephalometric analysis software (V-Ceph version 7, Osstem Implant Co., Ltd., Seoul, Korea) was used to digitize the records of the 41 cephalometric landmarks. This software displayed the cephalograms and obtained the coordinates of each landmark.
In this experiment, 20 final-year students from the School of Dentistry, Jeonbuk National University, South Korea were selected as beginners. Ten cephalograms that had been analyzed twice (at a 1-week interval) were used to evaluate the support ability of AI. Analyses of cephalograms were performed at the following two timepoints.
(1) Twenty dental students were educated regarding the definitions of cephalometric landmarks and the use of the V-Ceph software before the experiment. All students traced the positions of anatomical landmarks without AI support. After tracing, the ground truth was not provided for the students. (2) After 1 week, all students traced the anatomical landmarks on 10 randomly arranged images while going through the answers provided by the AI model. The students were not reported about reusing the images from the previous experiment. These answers were displayed separately from the actual screen of landmarks. If the students changed their answers by replacing them with the answers provided by AI, the changes were recorded.

Network Architecture and Implementation Details
Our architecture was based on the attention U-Net [17]. The contracting path and the expansive path consisted of repeated applications of two 3 × 3 convolutions, each followed by LeakyReLU activation and 2 × 2 max pooling (for the contracting path) or up-sampling (for the expansive path) [18]. The number of feature channels increased from 64 to 128, 256, and 512 in the contracting path and decreased from 512 to 256, 128, and 64 in the expansive path. We used AC loss [2] as a cost function and minimized it by using an Adam optimizer. The initial learning rate was 1 × 10 −4 , and the learning rate was set by cosine annealing schedule.
Additionally, we performed data augmentation by rotating the input images randomly by [−25, 25], rescaling by [0.9, 1.1], and translating by [0, 0.05] for both the x-axis and y-axis. We changed the brightness, contrast, and hue of the input images randomly in the ranges [−1, 1], [−1, 1], and [−0.5, 0.5], respectively. The ranges are the ones given by PyTorch. With a 1/10 probability, we did not apply a data augmentation procedure to the input images to ensure that the deep learning model could learn the original image [2].
We trained and tested the network using an Intel Xeon Gold 6126 2.6 GHz CPU with 64 GB memory and a RTX 2080 Ti GPU with an 11 GB RAM. The average size of the input images was 2067 × 1675 pixels, and each image had a different pixel size. Therefore, we calculated the pixel length of a 50 mm X-ray ruler to calculate the pixel size for each test image. We resized the input images to 800 × 600 pixels with a mini-batch size of two to reduce the computing time. In the test phase, we resized the result to the original input size to obtain the correct result.

Evaluation Matrices
We used different measurement methods to measure the performance of the landmarkdetection model. The positions of the landmarks were identified using the x-and ycoordinates. The mean radial error (MRE) and standard deviation (SD) are calculated as follows: In these equations, N denotes the set size. The radial error (R) is the Euclidean distance and is defined as the distance between the predicted and actual coordinates.
The successful detection rate (SDR) is an important measure for this problem. The estimated coordinates are considered correct if the error between the estimated coordinates and the correct position is less than a precision range. The SDR was calculated as follows: SDR = number of successfully detected landmarks with respect to z N × 100% In this equation, z means the precision ranges of 2, 2.5, 3, and 4 mm. For the classification of anatomical types, the eight clinical measurements set in the IEEE ISBI 2015 challenge were analyzed (Supplementary Table S2) [19,20]. The measurement values and classification results derived by the dental residents were set as the reference values, while the classification results by the AI, beginners, and beginner-AI collaboration were compared using the successful classification rate (SCR).

Statistical Analysis
The benefits of the beginner-AI collaboration were analyzed for each landmark. A t-test was applied to compare the average SDR between the beginner-only and beginner-AI groups within 2, 2.5, 3, and 4 mm thresholds. All data were analyzed using IBM SPSS Statistics (version 20; IBM Corp., Armonk, NY, USA) and PRISM (version 8.0.2; GraphPad Software, Inc.; San Diego, CA, USA). Statistical significance was set at a p-value < 0.05.

Successful Detection Rate
The model achieved average SDRs of 73.32%, 80.39%, 85.61%, and 91.68% within 2, 2.5, 3, and 4 mm thresholds, respectively. Across all ranges, the sella exhibited the highest SDR, while the glabella exhibited the lowest SDR. In addition, the SDRs of maxilla 1 root (38%), mandible 6 root (36%), glabella (32%), and soft tissue nasion (38%) were low within the 2 mm threshold (Table 3).  Within a 2 mm threshold, the AI, beginner-AI, and beginner-only groups achieved SDRs of 73.17%, 52.73%, and 47.4%, respectively. Furthermore, the average MREs and SDs of AI, beginner-AI, and beginner-only groups were 1.89 ± 2.63 mm, 3.14 ± 4.06 mm, and 3.72 ± 4.52 mm, respectively. Details are reported in Table 4. Furthermore, a comparison between beginner-only and beginner-AI groups in terms of the SDR is shown in Figure 2.  Figure 2. Comparison between the beginner-only and beginner-artificial intelligence groups in terms of the successful detection rate. A t-test was applied to compare the average successful detection rates between the beginner-only and beginner-AI groups within 2, 2.5, 3, and 4 mm thresholds.
The beginner-AI collaboration improved the successful detection rates within 2, 2.5, 3, and 4 mm thresholds. Abbreviations: AI, artificial intelligence; ns, not significant.
The DACFL model showed that the lower lip exhibited the lowest MRE (0.62 ± 0.35 mm) and the highest SDR (100%) while the glabella exhibited the highest MRE (5.72 ± 3.54 mm) and lowest SDR (20%) ( Table 5). In the beginner-only group, mandible 1 crown exhibited the lowest MRE (1.31 ± 2.99 mm) and highest SDR (93%), while the glabella exhibited the highest MRE (8.9 ± 7.05 mm) and lowest SDR (20%). In the beginner-AI group, mandible 1 crown exhibited the lowest MRE (1.23 ± 2.96 mm) and highest SDR (94%), while the glabella exhibited the highest MRE (7.31 ± 5.42 mm) and lowest SDR (16%). The benefits of AI-beginner collaboration in the localization of anatomical landmarks and the number of decision changes among the beginners across 41 landmarks are presented in Figures 3 and 4.  Comparison between the beginner-only and beginner-artificial intelligence groups in terms of the successful detection rate. A t-test was applied to compare the average successful detection rates between the beginner-only and beginner-AI groups within 2, 2.5, 3, and 4 mm thresholds.
The beginner-AI collaboration improved the successful detection rates within 2, 2.5, 3, and 4 mm thresholds. Abbreviations: AI, artificial intelligence; ns, not significant.
The DACFL model showed that the lower lip exhibited the lowest MRE (0.62 ± 0.35 mm) and the highest SDR (100%) while the glabella exhibited the highest MRE (5.72 ± 3.54 mm) and lowest SDR (20%) ( Table 5). In the beginner-only group, mandible 1 crown exhibited the lowest MRE (1.31 ± 2.99 mm) and highest SDR (93%), while the glabella exhibited the highest MRE (8.9 ± 7.05 mm) and lowest SDR (20%). In the beginner-AI group, mandible 1 crown exhibited the lowest MRE (1.23 ± 2.96 mm) and highest SDR (94%), while the glabella exhibited the highest MRE (7.31 ± 5.42 mm) and lowest SDR (16%). The benefits of AIbeginner collaboration in the localization of anatomical landmarks and the number of decision changes among the beginners across 41 landmarks are presented in Figures 3 and 4.

Successful Classification Rate
The AI, beginner-AI, and beginner-only groups achieved SCRs of 83.75%, 69.69%, and 61.31%, respectively (Table 6). In the AI group, the SNA (100%) and FHA (100%) exhibited the highest SCR, while the ANB (60%) exhibited the lowest SCR. In the beginner-only group, the MW (81%) exhibited the highest SCR, while the ANB (47%) exhibited the lowest SCR. Among beginner-AI group, the FHA (88%) exhibited the highest SCR while ANB (52%) exhibited the lowest SCR. A comparison between beginner-only and beginner-AI groups in terms of eight measurement parameters is shown in Figure 5.

Successful Classification Rate
The AI, beginner-AI, and beginner-only groups achieved SCRs of 83.75%, 69.69%, and 61.31%, respectively (Table 6). In the AI group, the SNA (100%) and FHA (100%) exhibited the highest SCR, while the ANB (60%) exhibited the lowest SCR. In the beginneronly group, the MW (81%) exhibited the highest SCR, while the ANB (47%) exhibited the lowest SCR. Among beginner-AI group, the FHA (88%) exhibited the highest SCR while ANB (52%) exhibited the lowest SCR. A comparison between beginner-only and beginner-AI groups in terms of eight measurement parameters is shown in Figure 5.

Performance of the DACFL Model on the Private Dataset
Most previous studies have tested the accuracy of anatomical landmark detection on IEEE ISBI 2015 lateral cephalograms [2][3][4][5][6]9,15], possibly showing high comparability, but limited generalizability. Therefore, testing broad data can demonstrate the generalizability and robustness of the model. Among the previous models, the DACFL model showed a high SDR as a state-of-the-art model for cephalometric landmark detection [2]. In the case of private cephalograms, the model showed a slight reduction in the SDR within a 2 mm threshold. This result was superior or similar to those from previous studies [3,4,12]. In a previous study, an even more dramatic drop in the accuracy was observed when the models were tested on a fully external dataset [3,4]. Overall, the results for the private dataset were inferior to those for the public dataset with standardized images [2][3][4][5][6]9,15].

Performance of the DACFL Model on the Private Dataset
Most previous studies have tested the accuracy of anatomical landmark detection on IEEE ISBI 2015 lateral cephalograms [2][3][4][5][6]9,15], possibly showing high comparability, but limited generalizability. Therefore, testing broad data can demonstrate the generalizability and robustness of the model. Among the previous models, the DACFL model showed a high SDR as a state-of-the-art model for cephalometric landmark detection [2]. In the case of private cephalograms, the model showed a slight reduction in the SDR within a 2 mm threshold. This result was superior or similar to those from previous studies [3,4,12]. In a previous study, an even more dramatic drop in the accuracy was observed when the models were tested on a fully external dataset [3,4]. Overall, the results for the private dataset were inferior to those for the public dataset with standardized images [2][3][4][5][6]9,15].
In the present study, the private dataset was associated with difficulties in landmark detection in children. These difficulties were probably due to low bone density, size and shape variability of anatomical structures, and the existence of primary teeth and permanent tooth germs. In addition to maxillofacial anatomy, patients' heads vary in shape. Although we selected a reference cephalogram that was closely matched to the one from training data for each test, there were still missed situations. Correct head positioning of the patient during the procedure is important to avoid errors in the identification and measurement of landmarks [4,21,22]. It is difficult to maintain the heads of children in standard positions. In addition to the quality of dataset, the number of images and cephalometric landmarks also influence the results. A previous study showed that the accuracy of AI increased linearly with an increasing number of learning datasets and decreased with an increasing number of detection targets [23]. Our study used an insufficiently large number of images and detected 41 anatomical landmarks. The training data should be increased to an optimal number of images between 2300 and 5400 to improve the performance of landmark detection.
For clinical applications, a mean error within a 2 mm threshold has been suggested to be acceptable in many related studies [2][3][4][5][6][9][10][11][13][14][15][16]24]. Therefore, the MRE in the present study was clinically acceptable. However, while assessing which specific landmarks were prone to incorrect detection, the maxilla 1 root, mandible 6 root, glabella, and soft tissue nasion showed greater deviations. These findings are not consistent with those from previous studies [3,4,12]. This discrepancy can be explained by the fact that maxilla 1 root was affected by the existence of maxillary anterior tooth germs. The location of the apex was based on general knowledge of the expected taper perceived from the crown and visible portion of the root. This problem was also encountered during previous research on reliability [10,11,[25][26][27][28]. Furthermore, the mandible 6 root was affected by overlapping structures, which is a common problem in the lateral cephalograms. Dental landmarks usually tend to have poorer validity than skeletal landmarks [10,11,27]. Soft tissue nasion and the glabella were located in areas with considerably higher dark. Thus, it was difficult to identify these soft tissue landmarks precisely, even in a magnified view.

Impact of AI-Based Assistance on the Performance of Beginners in Cephalometric Landmark Detection
The AI group had the highest average SDR, followed by the beginner-AI and beginneronly groups. With AI support, the average SDR increased by up to 5.33% within a 2 mm threshold, while the average MRE decreased. Detection of porion, basion, nasion, articulare, soft tissue A, soft tissue pogonion, and PPOcc improved over 10% in terms of SDR. The remaining landmarks were detected with little or no improvement in the SDR (Figure 2). In general, AI aids beginners in improving landmark detection. This was demonstrated by the impact of the beginner-AI collaboration on each landmark ( Figure 3). However, the improvement was insignificant, since there were small changes in the positions of the landmarks (Figure 4).
In addition to the SDR, we calculated the SCR to evaluate the classification performance. The DACFL model showed better results than previous models [6,12,29]. As expected, measurements consisting of landmarks with higher SDRs yielded higher SCR values. The average SCRs of the three groups showed a descending trend similar to that observed in case of average SDRs (highest in the AI group, followed by the beginner-AI group and the beginner-only group). With AI support, the average SCR increased by 8.38%, but the increase was not statistically significant. This may be explained by the low increase in the SDRs with AI support. The SCRs for the measurement of SNA, SNB, APDI, FHI, and FHA improved over 10%, while the SCR showed little improvement for the remaining measurements (Table 6).
In the present study, beginners were the final-year dental students with little experience in the detection of cephalometric landmarks. The precision of landmark identification can be affected by various factors such as the level of knowledge, individual understanding of the definitions of landmarks, and quality of cephalometric images [30,31]. Among the soft tissue landmarks, glabella, soft tissue nasion, columella, soft tissue A, and stms showed low SDRs due to higher dark in these regions. Problems with image quality influenced the ability of dental students who lacked experience in cephalometric landmark detection. In a previous study, dental students showed a high variability in landmark identification results [32]. This finding is consistent with the results of the present study (Table 5). Furthermore, inexperienced annotators exhibited a lower accuracy of landmark detection than AI for lateral cephalograms, which was consistent with the results of a previous study involving frontal cephalograms [33].
Our study has several limitations. The private dataset was small and had fewer variations. The patients were children and adolescents. This might have influenced the detection of cephalometric landmark. Thus, private datasets for adults should be investigated to confirm the performance of the DACFL model. The number of cephalometric landmarks was not sufficiently large to examine the full ability the of model. Moreover, landmark identification was performed by beginners. A previous study showed that experienced orthodontists exhibited lower variability in landmark detection than dental students. Further studies are necessary to demonstrate the benefits of a collaboration between AI and experienced orthodontists.

Conclusions
Our study showed that the DACFL model achieved an SDR of 73.17% within a 2 mm threshold on a private dataset. Furthermore, the beginner-AI collaboration improved the SDR by 5.33% within a 2 mm threshold and also improved the SCR by 8.38% when compared with beginners. These results suggest that the DACFL model is applicable to clinical orthodontic diagnosis. Further studies should be performed to demonstrate the benefits of a collaboration between AI and experienced orthodontists.

Informed Consent Statement:
The requirement for patient consent was waived since the present study was not a human subject research project specified in the Bioethics and Safety Act. Moreover, it was practically impossible to obtain the consent of the research subjects and the risk to these subjects was extremely low, since existing data were used in this retrospective study.

Data Availability Statement:
The datasets used and/or analyzed during the present study can be obtained from the corresponding author upon reasonable request.